Best way to handle huge fields with GSON JsonReader

Best way to handle huge fields with GSON JsonReader - java

I'm getting a java.lang.OutOfMemoryError: Java heap space even with GSON Streaming.
{"result":"OK","base64":"JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PC...."}
base64 can be up to 200Mb long. GSON is taking much more memory than that, (3GB) When I try to store the base64 in a variable I get a:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
at java.lang.StringBuilder.append(StringBuilder.java:204)
at com.google.gson.stream.JsonReader.nextQuotedValue(JsonReader.java:1014)
at com.google.gson.stream.JsonReader.nextString(JsonReader.java:815)
What is the best way to handle this kind of fields?

The reason of why you're getting OutOfMemoryError is that GSON nextString() returns a string that's aggregated during building a very huge string using StringBuilder. When you're facing with such an issue, you have to deal with intermediate data since there is no other choice. Unfortunately, GSON does not let you to process huge literals in any way.
Not sure if you can change the response payload, but if you can't, you might want to implement your own JSON reader, or "hack" the existing JsonReader to make it work in streaming fashion. The example below is based on GSON 2.5 and makes heavy use of reflection because JsonReader hides its state very carefully.
EnhancedGson25JsonReader.java
final class EnhancedGson25JsonReader
extends JsonReader {
// A listener to accept the internal character buffers.
// Accepting a single string built on such buffers is total memory waste as well.
interface ISlicedStringListener {
void accept(char[] buffer, int start, int length)
throws IOException;
}
// The constants can be just copied
/** #see JsonReader#PEEKED_NONE */
private static final int PEEKED_NONE = 0;
/** #see JsonReader#PEEKED_SINGLE_QUOTED */
private static final int PEEKED_SINGLE_QUOTED = 8;
/** #see JsonReader#PEEKED_DOUBLE_QUOTED */
private static final int PEEKED_DOUBLE_QUOTED = 9;
// Here is a bunch of spies made to "spy" for the parent's class state
private final FieldSpy<Integer> peeked;
private final MethodSpy<Integer> doPeek;
private final MethodSpy<Integer> getLineNumber;
private final MethodSpy<Integer> getColumnNumber;
private final FieldSpy<char[]> buffer;
private final FieldSpy<Integer> pos;
private final FieldSpy<Integer> limit;
private final MethodSpy<Character> readEscapeCharacter;
private final FieldSpy<Integer> lineNumber;
private final FieldSpy<Integer> lineStart;
private final MethodSpy<Boolean> fillBuffer;
private final MethodSpy<IOException> syntaxError;
private final FieldSpy<Integer> stackSize;
private final FieldSpy<int[]> pathIndices;
private EnhancedJsonReader(final Reader reader)
throws NoSuchFieldException, NoSuchMethodException {
super(reader);
peeked = spyField(JsonReader.class, this, "peeked");
doPeek = spyMethod(JsonReader.class, this, "doPeek");
getLineNumber = spyMethod(JsonReader.class, this, "getLineNumber");
getColumnNumber = spyMethod(JsonReader.class, this, "getColumnNumber");
buffer = spyField(JsonReader.class, this, "buffer");
pos = spyField(JsonReader.class, this, "pos");
limit = spyField(JsonReader.class, this, "limit");
readEscapeCharacter = spyMethod(JsonReader.class, this, "readEscapeCharacter");
lineNumber = spyField(JsonReader.class, this, "lineNumber");
lineStart = spyField(JsonReader.class, this, "lineStart");
fillBuffer = spyMethod(JsonReader.class, this, "fillBuffer", int.class);
syntaxError = spyMethod(JsonReader.class, this, "syntaxError", String.class);
stackSize = spyField(JsonReader.class, this, "stackSize");
pathIndices = spyField(JsonReader.class, this, "pathIndices");
}
static EnhancedJsonReader getEnhancedGson25JsonReader(final Reader reader) {
try {
return new EnhancedJsonReader(reader);
} catch ( final NoSuchFieldException | NoSuchMethodException ex ) {
throw new RuntimeException(ex);
}
}
// This method has been copied and reworked from the nextString() implementation
void nextSlicedString(final ISlicedStringListener listener)
throws IOException {
int p = peeked.get();
if ( p == PEEKED_NONE ) {
p = doPeek.get();
}
switch ( p ) {
case PEEKED_SINGLE_QUOTED:
nextQuotedSlicedValue('\'', listener);
break;
case PEEKED_DOUBLE_QUOTED:
nextQuotedSlicedValue('"', listener);
break;
default:
throw new IllegalStateException("Expected a string but was " + peek()
+ " at line " + getLineNumber.get()
+ " column " + getColumnNumber.get()
+ " path " + getPath()
);
}
peeked.accept(PEEKED_NONE);
pathIndices.get()[stackSize.get() - 1]++;
}
// The following method is also a copy-paste that was patched for the "spies".
// It's, in principle, the same as the source one, but it has one more buffer singleCharBuffer
// in order not to add another method to the ISlicedStringListener interface (enjoy lamdbas as much as possible).
// Note that the main difference between these two methods is that this one
// does not aggregate a single string value, but just delegates the internal
// buffers to call-sites, so the latter ones might do anything with the buffers.
/**
* #see JsonReader#nextQuotedValue(char)
*/
private void nextQuotedSlicedValue(final char quote, final ISlicedStringListener listener)
throws IOException {
final char[] buffer = this.buffer.get();
final char[] singleCharBuffer = new char[1];
while ( true ) {
int p = pos.get();
int l = limit.get();
int start = p;
while ( p < l ) {
final int c = buffer[p++];
if ( c == quote ) {
pos.accept(p);
listener.accept(buffer, start, p - start - 1);
return;
} else if ( c == '\\' ) {
pos.accept(p);
listener.accept(buffer, start, p - start - 1);
singleCharBuffer[0] = readEscapeCharacter.get();
listener.accept(singleCharBuffer, 0, 1);
p = pos.get();
l = limit.get();
start = p;
} else if ( c == '\n' ) {
lineNumber.accept(lineNumber.get() + 1);
lineStart.accept(p);
}
}
listener.accept(buffer, start, p - start);
pos.accept(p);
if ( !fillBuffer.apply(just1) ) {
throw syntaxError.apply(justUnterminatedString);
}
}
}
// Save some memory
private static final Object[] just1 = { 1 };
private static final Object[] justUnterminatedString = { "Unterminated string" };
}
FieldSpy.java
final class FieldSpy<T>
implements Supplier<T>, Consumer<T> {
private final Object instance;
private final Field field;
private FieldSpy(final Object instance, final Field field) {
this.instance = instance;
this.field = field;
}
static <T> FieldSpy<T> spyField(final Class<?> declaringClass, final Object instance, final String fieldName)
throws NoSuchFieldException {
final Field field = declaringClass.getDeclaredField(fieldName);
field.setAccessible(true);
return new FieldSpy<>(instance, field);
}
#Override
public T get() {
try {
#SuppressWarnings("unchecked")
final T value = (T) field.get(instance);
return value;
} catch ( final IllegalAccessException ex ) {
throw new RuntimeException(ex);
}
}
#Override
public void accept(final T value) {
try {
field.set(instance, value);
} catch ( final IllegalAccessException ex ) {
throw new RuntimeException(ex);
}
}
}
MethodSpy.java
final class MethodSpy<T>
implements Function<Object[], T>, Supplier<T> {
private static final Object[] emptyObjectArray = {};
private final Object instance;
private final Method method;
private MethodSpy(final Object instance, final Method method) {
this.instance = instance;
this.method = method;
}
static <T> MethodSpy<T> spyMethod(final Class<?> declaringClass, final Object instance, final String methodName, final Class<?>... parameterTypes)
throws NoSuchMethodException {
final Method method = declaringClass.getDeclaredMethod(methodName, parameterTypes);
method.setAccessible(true);
return new MethodSpy<>(instance, method);
}
#Override
public T get() {
// my javac generates useless new Object[0] if no args passed
return apply(emptyObjectArray);
}
#Override
public T apply(final Object[] arguments) {
try {
#SuppressWarnings("unchecked")
final T value = (T) method.invoke(instance, arguments);
return value;
} catch ( final IllegalAccessException | InvocationTargetException ex ) {
throw new RuntimeException(ex);
}
}
}
HugeJsonReaderDemo.java
And here is a demo that uses that method to read a huge JSON and redirect its string values to a another file.
public static void main(final String... args)
throws IOException {
try ( final EnhancedGson25JsonReader input = getEnhancedGson25JsonReader(new InputStreamReader(new FileInputStream("./huge.json")));
final Writer output = new OutputStreamWriter(new BufferedOutputStream(new FileOutputStream("./huge.json.STRINGS"))) ) {
while ( input.hasNext() ) {
final JsonToken token = input.peek();
switch ( token ) {
case BEGIN_OBJECT:
input.beginObject();
break;
case NAME:
input.nextName();
break;
case STRING:
input.nextSlicedString(output::write);
break;
default:
throw new AssertionError(token);
}
}
}
}
I successfully extracted the fields above to a file. The input file was 544 MB (570 425 371 bytes) length and generated out of the following JSON chunks:
{"result":"OK","base64":"
JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PC × 16777216 (2^24)
"}
And the result is (since I just redirect all strings to the file):
OK
JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PC × 16777216 (2^24)
I think that you faced with a very interesting issue. It would be nice to have some feedback from the GSON team on possible API enhancement.

Related

Apache Hbase MapReduce job take too much time while reading the datastore

I have setup Apache Hbase, Nutch and Hadoop cluster. I have crawled few documents i.e., about 30 Million. There are 3 workers in the cluster and 1 master. I have write my own Hbase mapreduce job to read crawled data and change some score little bit based on some logic.
For this purpose, I have combined the documents of same domain and found their effective bytes and found some score. Later, in reducer, I have assigned that score to each URL of that domain (via cache). This portion of job takes took much time i.e., 16 hours. Following is the code snippet
for ( int index = 0; index < Cache.size(); index++) {
String Orig_key = Cache.get(index);
float doc_score = log10;
WebPage page = datastore.get(Orig_key);
if ( page == null ) {
continue;
}
page.setScore(doc_score);
if (mark) {
page.getMarkers().put( Queue, Q1);
}
context.write(Orig_key, page);
}
If I remove that document read statement from datastore then job is finished in 2 to 3 hours only. That why, I think the statement WebPage page = datastore.get(Orig_key); is causing this problem. Is'nt it ?
If that is the case then what is best approach. The Cache object is simply a list that contains URLs of same domain.
DomainAnalysisJob.java
...
...
public class DomainAnalysisJob implements Tool {
public static final Logger LOG = LoggerFactory
.getLogger(DomainAnalysisJob.class);
private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
private Configuration conf;
protected static final Utf8 URL_ORIG_KEY = new Utf8("doc_orig_id");
protected static final Utf8 DOC_DUMMY_MARKER = new Utf8("doc_marker");
protected static final Utf8 DUMMY_KEY = new Utf8("doc_id");
protected static final Utf8 DOMAIN_DUMMY_MARKER = new Utf8("domain_marker");
protected static final Utf8 LINK_MARKER = new Utf8("link");
protected static final Utf8 Queue = new Utf8("q");
private static URLNormalizers urlNormalizers;
private static URLFilters filters;
private static int maxURL_Length;
static {
FIELDS.add(WebPage.Field.STATUS);
FIELDS.add(WebPage.Field.LANG_INFO);
FIELDS.add(WebPage.Field.URDU_SCORE);
FIELDS.add(WebPage.Field.MARKERS);
FIELDS.add(WebPage.Field.INLINKS);
}
/**
* Maps each WebPage to a host key.
*/
public static class Mapper extends GoraMapper<String, WebPage, Text, WebPage> {
#Override
protected void setup(Context context) throws IOException ,InterruptedException {
Configuration conf = context.getConfiguration();
urlNormalizers = new URLNormalizers(context.getConfiguration(), URLNormalizers.SCOPE_DEFAULT);
filters = new URLFilters(context.getConfiguration());
maxURL_Length = conf.getInt("url.characters.max.length", 2000);
}
#Override
protected void map(String key, WebPage page, Context context)
throws IOException, InterruptedException {
String reversedHost = null;
if (page == null) {
return;
}
if ( key.length() > maxURL_Length ) {
return;
}
String url = null;
try {
url = TableUtil.unreverseUrl(key);
url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT);
url = filters.filter(url); // filter the url
} catch (Exception e) {
LOG.warn("Skipping " + key + ":" + e);
return;
}
if ( url == null) {
context.getCounter("DomainAnalysis", "FilteredURL").increment(1);
return;
}
try {
reversedHost = TableUtil.getReversedHost(key.toString());
}
catch (Exception e) {
return;
}
page.getMarkers().put( URL_ORIG_KEY, new Utf8(key) );
context.write( new Text(reversedHost), page );
}
}
public DomainAnalysisJob() {
}
public DomainAnalysisJob(Configuration conf) {
setConf(conf);
}
#Override
public Configuration getConf() {
return conf;
}
#Override
public void setConf(Configuration conf) {
this.conf = conf;
}
public void updateDomains(boolean buildLinkDb, int numTasks) throws Exception {
NutchJob job = NutchJob.getInstance(getConf(), "rankDomain-update");
job.getConfiguration().setInt("mapreduce.task.timeout", 1800000);
if ( numTasks < 1) {
job.setNumReduceTasks(job.getConfiguration().getInt(
"mapred.map.tasks", job.getNumReduceTasks()));
} else {
job.setNumReduceTasks(numTasks);
}
ScoringFilters scoringFilters = new ScoringFilters(getConf());
HashSet<WebPage.Field> fields = new HashSet<WebPage.Field>(FIELDS);
fields.addAll(scoringFilters.getFields());
StorageUtils.initMapperJob(job, fields, Text.class, WebPage.class,
Mapper.class);
StorageUtils.initReducerJob(job, DomainAnalysisReducer.class);
job.waitForCompletion(true);
}
#Override
public int run(String[] args) throws Exception {
boolean linkDb = false;
int numTasks = -1;
for (int i = 0; i < args.length; i++) {
if ("-rankDomain".equals(args[i])) {
linkDb = true;
} else if ("-crawlId".equals(args[i])) {
getConf().set(Nutch.CRAWL_ID_KEY, args[++i]);
} else if ("-numTasks".equals(args[i]) ) {
numTasks = Integer.parseInt(args[++i]);
}
else {
throw new IllegalArgumentException("unrecognized arg " + args[i]
+ " usage: updatedomain -crawlId <crawlId> [-numTasks N]" );
}
}
LOG.info("Updating DomainRank:");
updateDomains(linkDb, numTasks);
return 0;
}
public static void main(String[] args) throws Exception {
final int res = ToolRunner.run(NutchConfiguration.create(),
new DomainAnalysisJob(), args);
System.exit(res);
}
}
DomainAnalysisReducer.java
...
...
public class DomainAnalysisReducer extends
GoraReducer<Text, WebPage, String, WebPage> {
public static final Logger LOG = DomainAnalysisJob.LOG;
public DataStore<String, WebPage> datastore;
protected static float q1_ur_threshold = 500.0f;
protected static float q1_ur_docCount = 50;
public static final Utf8 Queue = new Utf8("q"); // Markers for Q1 and Q2
public static final Utf8 Q1 = new Utf8("q1");
public static final Utf8 Q2 = new Utf8("q2");
#Override
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
try {
datastore = StorageUtils.createWebStore(conf, String.class, WebPage.class);
}
catch (ClassNotFoundException e) {
throw new IOException(e);
}
q1_ur_threshold = conf.getFloat("domain.queue.threshold.bytes", 500.0f);
q1_ur_docCount = conf.getInt("domain.queue.doc.count", 50);
LOG.info("Conf updated: Queue-bytes-threshold = " + q1_ur_threshold + " Queue-doc-threshold: " + q1_ur_docCount);
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
datastore.close();
}
#Override
protected void reduce(Text key, Iterable<WebPage> values, Context context)
throws IOException, InterruptedException {
ArrayList<String> Cache = new ArrayList<String>();
int doc_counter = 0;
int total_ur_bytes = 0;
for ( WebPage page : values ) {
// cache
String orig_key = page.getMarkers().get( DomainAnalysisJob.URL_ORIG_KEY ).toString();
Cache.add(orig_key);
// do not consider those doc's that are not fetched or link URLs
if ( page.getStatus() == CrawlStatus.STATUS_UNFETCHED ) {
continue;
}
doc_counter++;
int ur_score_int = 0;
int doc_ur_bytes = 0;
int doc_total_bytes = 0;
String ur_score_str = "0";
String langInfo_str = null;
// read page and find its Urdu score
langInfo_str = TableUtil.toString(page.getLangInfo());
if (langInfo_str == null) {
continue;
}
ur_score_str = TableUtil.toString(page.getUrduScore());
ur_score_int = Integer.parseInt(ur_score_str);
doc_total_bytes = Integer.parseInt( langInfo_str.split("&")[0] );
doc_ur_bytes = ( doc_total_bytes * ur_score_int) / 100; //Formula to find ur percentage
total_ur_bytes += doc_ur_bytes;
}
float avg_bytes = 0;
float log10 = 0;
if ( doc_counter > 0 && total_ur_bytes > 0) {
avg_bytes = (float) total_ur_bytes/doc_counter;
log10 = (float) Math.log10(avg_bytes);
log10 = (Math.round(log10 * 100000f)/100000f);
}
context.getCounter("DomainAnalysis", "DomainCount").increment(1);
// if average bytes and doc count, are more than threshold then mark as q1
boolean mark = false;
if ( avg_bytes >= q1_ur_threshold && doc_counter >= q1_ur_docCount ) {
mark = true;
for ( int index = 0; index < Cache.size(); index++) {
String Orig_key = Cache.get(index);
float doc_score = log10;
WebPage page = datastore.get(Orig_key);
if ( page == null ) {
continue;
}
page.setScore(doc_score);
if (mark) {
page.getMarkers().put( Queue, Q1);
}
context.write(Orig_key, page);
}
}
}
In my testing and debugging, I have found that the statement WebPage page = datastore.get(Orig_key); is major cause of too much time. It took about 16 hours to complete the job but when I replaced this statement with WebPage page = WebPage.newBuilder().build(); the time was reduced to 6 hours. Is this due to IO ?

Java, Refactoring case

I was given exercise that I need to refactor several java projects.
Only those 2 left which I truly don't have an idea how to refactor.
csv.writer
public class CsvWriter {
public CsvWriter() {
}
public void write(String[][] lines) {
for (int i = 0; i < lines.length; i++)
writeLine(lines[i]);
}
private void writeLine(String[] fields) {
if (fields.length == 0)
System.out.println();
else {
writeField(fields[0]);
for (int i = 1; i < fields.length; i++) {
System.out.print(",");
writeField(fields[i]);
}
System.out.println();
}
}
private void writeField(String field) {
if (field.indexOf(',') != -1 || field.indexOf('\"') != -1)
writeQuoted(field);
else
System.out.print(field);
}
private void writeQuoted(String field) {
System.out.print('\"');
for (int i = 0; i < field.length(); i++) {
char c = field.charAt(i);
if (c == '\"')
System.out.print("\"\"");
else
System.out.print(c);
}
System.out.print('\"');
}
}
csv.writertest
public class CsvWriterTest {
#Test
public void testWriter() {
CsvWriter writer = new CsvWriter();
String[][] lines = new String[][] {
new String[] {},
new String[] { "only one field" },
new String[] { "two", "fields" },
new String[] { "", "contents", "several words included" },
new String[] { ",", "embedded , commas, included",
"trailing comma," },
new String[] { "\"", "embedded \" quotes",
"multiple \"\"\" quotes\"\"" },
new String[] { "mixed commas, and \"quotes\"", "simple field" } };
// Expected:
// -- (empty line)
// only one field
// two,fields
// ,contents,several words included
// ",","embedded , commas, included","trailing comma,"
// """","embedded "" quotes","multiple """""" quotes"""""
// "mixed commas, and ""quotes""",simple field
writer.write(lines);
}
}
test
public class Configuration {
public int interval;
public int duration;
public int departure;
public void load(Properties props) throws ConfigurationException {
String valueString;
int value;
valueString = props.getProperty("interval");
if (valueString == null) {
throw new ConfigurationException("monitor interval");
}
value = Integer.parseInt(valueString);
if (value <= 0) {
throw new ConfigurationException("monitor interval > 0");
}
interval = value;
valueString = props.getProperty("duration");
if (valueString == null) {
throw new ConfigurationException("duration");
}
value = Integer.parseInt(valueString);
if (value <= 0) {
throw new ConfigurationException("duration > 0");
}
if ((value % interval) != 0) {
throw new ConfigurationException("duration % interval");
}
duration = value;
valueString = props.getProperty("departure");
if (valueString == null) {
throw new ConfigurationException("departure offset");
}
value = Integer.parseInt(valueString);
if (value <= 0) {
throw new ConfigurationException("departure > 0");
}
if ((value % interval) != 0) {
throw new ConfigurationException("departure % interval");
}
departure = value;
}
}
public class ConfigurationException extends Exception {
private static final long serialVersionUID = 1L;
public ConfigurationException() {
super();
}
public ConfigurationException(String arg0) {
super(arg0);
}
public ConfigurationException(String arg0, Throwable arg1) {
super(arg0, arg1);
}
public ConfigurationException(Throwable arg0) {
super(arg0);
}
}
configuration.test
public class ConfigurationTest{
#Test
public void testGoodInput() throws IOException {
String data = "interval = 10\nduration = 100\ndeparture = 200\n";
Properties input = loadInput(data);
Configuration props = new Configuration();
try {
props.load(input);
} catch (ConfigurationException e) {
assertTrue(false);
return;
}
assertEquals(props.interval, 10);
assertEquals(props.duration, 100);
assertEquals(props.departure, 200);
}
#Test
public void testNegativeValues() throws IOException {
processBadInput("interval = -10\nduration = 100\ndeparture = 200\n");
processBadInput("interval = 10\nduration = -100\ndeparture = 200\n");
processBadInput("interval = 10\nduration = 100\ndeparture = -200\n");
}
#Test
public void testInvalidDuration() throws IOException {
processBadInput("interval = 10\nduration = 99\ndeparture = 200\n");
}
#Test
public void testInvalidDeparture() throws IOException {
processBadInput("interval = 10\nduration = 100\ndeparture = 199\n");
}
#Test
private void processBadInput(String data) throws IOException {
Properties input = loadInput(data);
boolean failed = false;
Configuration props = new Configuration();
try {
props.load(input);
} catch (ConfigurationException e) {
failed = true;
}
assertTrue(failed);
}
#Test
private Properties loadInput(String data) throws IOException {
InputStream is = new StringBufferInputStream(data);
Properties input = new Properties();
input.load(is);
is.close();
return input;
}
}

Ok, here some advice regarding the code.
CsvWriter
The bad thing is that you print everything to System.out. It will be hard to test without mocks. Instead I suggest you to add field PrintStream which defines where all data will go.
import java.io.PrintStream;
public class CsvWriter {
private final PrintStream printStream;
public CsvWriter() {
this.printStream = System.out;
}
public CsvWriter(PrintStream printStream) {
this.printStream = printStream;
}
...
You then write everything to this stream. This refactoring easy since you use replace function(Ctrl+R in IDEA). Here is the example how you do it.
private void writeField(String field) {
if (field.indexOf(',') != -1 || field.indexOf('\"') != -1)
writeQuoted(field);
else
printStream.print(field);
}
Others stuff seems ok in this class.
CsvWriterTest
First thing first you don't check all logic in a single method. Make small methods with different kind of tests. It's ok to keep your current test though. Sometimes it's useful to check most of the logic in a complex scenario.
Also pay attention to the names of the methods. Check this
Obviously you test doesn't check the results. That's why we need this functionality with PrintStream. We can build a PrintStream on top of the instance of ByteArrayOutputStream. We then construct a string and check if the content is valid. Here is how you can easily check what was written
public class CsvWriterTest {
private ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
private PrintStream printStream = new PrintStream(byteArrayOutputStream);
#Test
public void testWriter() {
CsvWriter writer = new CsvWriter(printStream);
... old logic here ...
writer.write(lines);
String result = new String(byteArrayOutputStream.toByteArray());
Assert.assertTrue(result.contains("two,fields"));
Configuration
Make fields private
Make messages more concise
ConfigurationException
Seems good about serialVersionUID. This thing is needed for serialization/deserialization.
ConfigurationTest
Do not use assertTrue(false/failed); Use Assert.fail(String) with some message which is understandable.
Tip: if you don't have much experience and need to refactor code like this, you may want to read some chapters of Effective Java 2nd edition by Joshua Bloch. The book is not so big so you can read it in a week and it has some rules how to write clean and understandable code.

How to eliminate repeat code in a for-loop?

I have implemented two member functions in the same class:
private static void getRequiredTag(Context context) throws IOException
{
//repeated begin
for (Record record : context.getContext().readCacheTable("subscribe")) {
String traceId = record.get("trace_id").toString();
if (traceSet.contains(traceId) == false)
continue;
String tagId = record.get("tag_id").toString();
try {
Integer.parseInt(tagId);
} catch (NumberFormatException e) {
context.getCounter("Error", "tag_id not a number").increment(1);
continue;
}
//repeated end
tagSet.add(tagId);
}
}
private static void addTagToTraceId(Context context) throws IOException
{
//repeated begin
for (Record record : context.getContext().readCacheTable("subscribe")) {
String traceId = record.get("trace_id").toString();
if (traceSet.contains(traceId) == false)
continue;
String tagId = record.get("tag_id").toString();
try {
Integer.parseInt(tagId);
} catch (NumberFormatException e) {
context.getCounter("Error", "tag_id not a number").increment(1);
continue;
}
//repeated end
Vector<String> ret = traceListMap.get(tagId);
if (ret == null) {
ret = new Vector<String>();
}
ret.add(traceId);
traceListMap.put(tagId, ret);
}
}
I will call that two member functions in another two member functions(so I can't merge them into one function):
private static void A()
{
getRequiredTag()
}
private static void B()
{
getRequiredTag()
addTagToTraceId()
}
tagSet is java.util.Set and traceListMap is java.util.Map.
I know DRY principle and I really want to eliminate the repeat code, so I come to this code:
private static void getTraceIdAndTagIdFromRecord(Record record, String traceId, String tagId) throws IOException
{
traceId = record.get("trace_id").toString();
tagId = record.get("tag_id").toString();
}
private static boolean checkTagIdIsNumber(String tagId)
{
try {
Integer.parseInt(tagId);
} catch (NumberFormatException e) {
return false;
}
return true;
}
private static void getRequiredTag(Context context) throws IOException
{
String traceId = null, tagId = null;
for (Record record : context.getContext().readCacheTable("subscribe")) {
getTraceIdAndTagIdFromRecord(record, traceId, tagId);
if (traceSet.contains(traceId) == false)
continue;
if (!checkTagIdIsNumber(tagId))
{
context.getCounter("Error", "tag_id not a number").increment(1);
continue;
}
tagSet.add(tagId);
}
}
private static void addTagToTraceId(Context context) throws IOException
{
String traceId = null, tagId = null;
for (Record record : context.getContext().readCacheTable("subscribe")) {
getTraceIdAndTagIdFromRecord(record, traceId, tagId);
if (traceSet.contains(traceId) == false)
continue;
if (!checkTagIdIsNumber(tagId))
{
context.getCounter("Error", "tag_id not a number").increment(1);
continue;
}
Vector<String> ret = traceListMap.get(tagId);
if (ret == null) {
ret = new Vector<String>();
}
ret.add(traceId);
traceListMap.put(tagId, ret);
}
}
It seems I got an new repeat... I have no idea to eliminate repeat in that case, could anybody give me some advice?
update 2015-5-13 21:15:12:
Some guys gives a boolean argument to eliminate repeat, but I know
Robert C. Martin's Clean Code Tip #12: Eliminate Boolean Arguments.(you can google it for more details).
Could you gives some comment about that?

The parts that changes requires the values of String tagId and String traceId so we will start by extracting an interface that takes those parameters:
public static class PerformingInterface {
void accept(String tagId, String traceId);
}
Then extract the common parts into this method:
private static void doSomething(Context context, PerformingInterface perform) throws IOException
{
String traceId = null, tagId = null;
for (Record record : context.getContext().readCacheTable("subscribe")) {
getTraceIdAndTagIdFromRecord(record, traceId, tagId);
if (traceSet.contains(traceId) == false)
continue;
if (!checkTagIdIsNumber(tagId))
{
context.getCounter("Error", "tag_id not a number").increment(1);
continue;
}
perform.accept(tagId, traceId);
}
}
Then call this method in two different ways:
private static void getRequiredTag(Context context) throws IOException {
doSomething(context, new PerformingInterface() {
#Override public void accept(String tagId, String traceId) {
tagSet.add(tagId);
}
});
}
private static void addTagToTraceId(Context context) throws IOException {
doSomething(context, new PerformingInterface() {
#Override public void accept(String tagId, String traceId) {
Vector<String> ret = traceListMap.get(tagId);
if (ret == null) {
ret = new Vector<String>();
}
ret.add(traceId);
traceListMap.put(tagId, ret);
}
});
}
Note that I am using lambdas here, which is a Java 8 feature (BiConsumer is also a functional interface defined in Java 8), but it is entirely possible to accomplish the same thing in Java 7 and less, it just requires some more verbose code.
Some other issues with your code:
Way too many things is static
The Vector class is old, it is more recommended to use ArrayList (if you need synchronization, wrap it in Collections.synchronizedList)
Always use braces, even for one-liners

You could use a stream (haven't tested):
private static Stream<Record> validRecords(Context context) throws IOException {
return context.getContext().readCacheTable("subscribe").stream()
.filter(r -> {
if (!traceSet.contains(traceId(r))) {
return false;
}
try {
Integer.parseInt(tagId(r));
return true;
} catch (NumberFormatException e) {
context.getCounter("Error", "tag_id not a number").increment(1);
return false;
}
});
}
private static String traceId(Record record) {
return record.get("trace_id").toString();
}
private static String tagId(Record record) {
return record.get("tag_id").toString();
}
Then could do just:
private static void getRequiredTag(Context context) throws IOException {
validRecords(context).map(r -> tagId(r)).forEach(tagSet::add);
}
private static void addTagToTraceId(Context context) throws IOException {
validRecords(context).forEach(r -> {
String tagId = tagId(r);
Vector<String> ret = traceListMap.get(tagId);
if (ret == null) {
ret = new Vector<String>();
}
ret.add(traceId(r));
traceListMap.put(tagId, ret);
});
}

tagId seems to be always null in your second attempt.
Nevertheless, one approach would be to extract the code that collects tagIds (this seems to be the same in both methods) into its own method. Then, in each of the two methods just iterate over the collection of returned tagIds and do different operations on them.
for (String tagId : getTagIds(context)) {
// do method specific logic
}
EDIT
Now I noticed that you also use traceId in the second method. The principle remains the same, just collect Records in a separate method and iterate over them in the two methods (by taking tagId and traceId from records).
Solution with lambdas is the most elegant one, but without them it involves creation of separate interface and two anonymous classes which is too verbose for this use case (honestly, here I would rather go with a boolean argument than with a strategy without lambdas).

Try this approach
private static void imYourNewMethod(Context context,Boolean isAddTag){
String traceId = null, tagId = null;
for (Record record : context.getContext().readCacheTable("subscribe")) {
getTraceIdAndTagIdFromRecord(record, traceId, tagId);
if (traceSet.contains(traceId) == false)
continue;
if (!checkTagIdIsNumber(tagId))
{
context.getCounter("Error", "tag_id not a number").increment(1);
continue;
}
if(isAddTag){
Vector<String> ret = traceListMap.get(tagId);
if (ret == null) {
ret = new Vector<String>();
}
ret.add(traceId);
traceListMap.put(tagId, ret);
}else{
tagSet.add(tagId);
}
}
call this method and pass one more parameter boolean true if you want to add otherwise false to get it.

Google Guava - pass parameters to load method in addition go KEY

I have written a program to cache objects using Google Guava. My problem is how to pass additional parameters to Guava Load method. Here is the code. Below you see in this line - I have made fileId and pageno as key - cache.get(fileID+pageNo);. Now when cache.get is called and when that key is not in the cache - guava will call the load method of class PreviewCacheLoader which I have given below as well.
public class PreviewCache {
static final LoadingCache<String, CoreObject> cache = CacheBuilder.newBuilder()
.maximumSize(1000)
.expireAfterWrite(5, TimeUnit.MINUTES)
.build(new PreviewCacheLoader());
public CoreObject getPreview(String strTempPath, int pageNo, int requiredHeight, String fileID, String strFileExt, String ssoId) throws IOException
{
CoreObject coreObject = null;
try {
coreObject = cache.get(fileID+pageNo, HOW TO PASS pageNO and requiredHeight because I want to keep key as ONLY fileID+pageNo );
} catch (ExecutionException e) {
e.printStackTrace();
}
return coreObject;
}
}
How to pass parameters from above which are int and String to below Load method in addition to key parameter
public class PreviewCacheLoader extends CacheLoader<String, CoreObject> {
#Override
public CoreObject load(String fileIDpageNo, HOW TO GET pageNO and requiredHeight) throws Exception {
CoreObject coreObject = new CoreObject();
// MAKE USE OF PARAMETERS pageNO and requiredHeight
// Populate coreObject here
return coreObject;
}
}

For starters, it's extremely bad programming practice to use fileId + pageNo as a String key instead of creating a proper object. (This is called "stringly typed" code.) The best way to solve your problem would probably look like:
class FileIdAndPageNo {
private final String fileId;
private final int pageNo;
...constructor, hashCode, equals...
}
public CoreObject getPreview(final int pageNo, final int requiredHeight, String fileID) { throws IOException
{
CoreObject coreObject = null;
try {
coreObject = cache.get(new FileIdAndPageNo(fileID, pageNo),
new Callable<CoreObject>() {
public CoreObject call() throws Exception {
// you have access to pageNo and requiredHeight here
}
});
} catch (ExecutionException e) {
e.printStackTrace();
}
return coreObject;
}

How to map Java's locales and Lucene's analyzers?

I would like to find the Lucene analyzer corresponding to the language of a Java locale.
For instance, Locale.ENGLISH would be mapped to org.apache.lucene.analysis.en.EnglishAnalyzer.
Is there an automated mapping somewhere?

This is not available out-of-the-box. See below the way I do it.
public final class LocaleAwareAnalyzer extends AnalyzerWrapper {
private static final Logger LOG = LoggerFactory.getLogger(LocaleAwareAnalyzer.class);
private final Analyzer defaultAnalyzer;
private final Map<String, Analyzer> perLocaleAnalyzer = perLocaleAnalyzers();
public LocaleAwareAnalyzer(final Analyzer defaultAnalyzer) {
this.defaultAnalyzer = Precondition.notNull("defaultAnalyzer", defaultAnalyzer);
}
#Override
protected Analyzer getWrappedAnalyzer(final String fieldName) {
if (fieldName == null) {
return defaultAnalyzer;
}
final int n = fieldName.indexOf('_');
if (n >= 0) {
// Unfortunately CharArrayMap does not offer get(CharSequence, start, end)
final String locale = fieldName.substring(n + 1);
final Analyzer a = perLocaleAnalyzer.get(locale);
if (a != null) {
return a;
}
LOG.warn("No Analyzer for Locale '%s', using default", locale);
}
return defaultAnalyzer;
}
#Override
protected TokenStreamComponents wrapComponents(final String fieldName,
final TokenStreamComponents components) {
return components;
}
private static Map<String, Analyzer> perLocaleAnalyzers() {
final Map<String, Analyzer> m = new HashMap<>();
m.put("en", new EnglishAnalyzer(Version.LUCENE_43));
m.put("es", new SpanishAnalyzer(Version.LUCENE_43));
m.put("de", new GermanAnalyzer(Version.LUCENE_43));
m.put("fr", new FrenchAnalyzer(Version.LUCENE_43));
// ... etc
return m;
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Best way to handle huge fields with GSON JsonReader - java

Related

Apache Hbase MapReduce job take too much time while reading the datastore

Java, Refactoring case

How to eliminate repeat code in a for-loop?

Google Guava - pass parameters to load method in addition go KEY

How to map Java's locales and Lucene's analyzers?

Categories

Resources