Jackson JSON parser invalid utf-8 start byte

Jackson JSON parser invalid utf-8 start byte - java

I'm trying to parse the following JSON and I keep getting a JsonParseException:
{
"episodes":{
"description":"Episode 3 – Oprah's Surprise Patrol from 1\/20\/04\nTake a trip down memory lane and hear all your favorite episodes of The Oprah Winfrey Show from the last 25 seasons -- everyday on your radio!"
}
}
also fails on this JSON
{
"episodes":{
"description":"After 20 years in sports talk…he’s still the top dog! Catch Christopher “Mad Dog” Russo weekday afternoons on Mad Dog Radio as he tells it like it is…Give the Doggie a call at 888-623-3646."
}
}
Exception:
org.codehaus.jackson.JsonParseException: Invalid UTF-8 start byte 0x96
at [Source: C:\Json Test Files\episodes.txt; line: 3, column: 33]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1291)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:385)
at org.codehaus.jackson.impl.Utf8StreamParser._reportInvalidInitial(Utf8StreamParser.java:2236)
at org.codehaus.jackson.impl.Utf8StreamParser._reportInvalidChar(Utf8StreamParser.java:2230)
at org.codehaus.jackson.impl.Utf8StreamParser._finishString2(Utf8StreamParser.java:1467)
at org.codehaus.jackson.impl.Utf8StreamParser._finishString(Utf8StreamParser.java:1394)
at org.codehaus.jackson.impl.Utf8StreamParser.getText(Utf8StreamParser.java:113)
at com.niveus.jackson.Main.parseEpisodes(Main.java:37)
at com.niveus.jackson.Main.main(Main.java:13)
Code:
public static void main(String [] args) {
parseEpisodes("C:\\Json Test Files\\episodes.txt");
}
public static void parseEpisodes(String filename) {
JsonFactory factory = new JsonFactory();
JsonParser parser = null;
String nameField = null;
try {
parser = factory.createJsonParser(new File(filename));
parser.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, true);
parser.configure(JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER, true);
JsonToken token = parser.nextToken();
nameField = parser.getText();
String desc = null;
while (token != JsonToken.END_OBJECT) {
if (nameField.equals("episodes")) {
while (token != JsonToken.END_OBJECT) {
if (nameField.equals("description")) {
parser.nextToken();
desc = parser.getText();
}
token = parser.nextToken();
nameField = parser.getText();
}
}
token = parser.nextToken();
nameField = parser.getText();
}
System.out.println(desc);
} catch (JsonParseException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

The character at column 33 is –, and the reason this would be the byte 0x96 is that the file is physically encoded as Windows-1252. You need to save the file in UTF-8, windows-1252 is not a valid encoding for json. How to do this depends on what text editor you are using.
See JSON RFC:
Encoding
JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.

I have also faced similar issue. open your json in Notepad ++, then in encoding drop down select as UTF-8. and save the text to other file. doing this resolved the issue.

Everything mentioned here I tried and none solved my issue, so I manually typed the payload and it solved my issue.

I know this question is old, but I would like to share something that works for me. It is possible to ignore the character in the following way.
Define a charset decoded
StandardCharsets.UTF_8.newDecoder().onMalformedInput(CodingErrorAction.IGNORE);
Use to read the InputStream
InputStreamReader stream = new InputStreamReader(resource.getInputStream(), CHARSET_DECODER)
Use the Jackson CSV mapper to read the content
new CsvMapper().readerFor(Map.class).readValues(stream);
The key element here is the charset decoder with the option IGNORE in the malformed input.

Related

ObjectOutputStream.writeUTF writes corrupt characters at the start

this is my .json file:
{"Usuarios":[{"password":"admin","apellido":"Admin","correo":"Adminadmin.com","direccion":"Admin","telefono":"Admin","nombre":"Admin","username":"admin"}]}
(I tried to translate my code from Spanish to English in the comments as best I could <3)
The function that writes in the JSON is this one:
public void agregarUsuario(String nombre, String apellido, String direccion, String telefono, String correo, String username, String password) {
try {
//String jsonString = JsonObject.toString();
JSONObject usuarios = getJSONObjectFromFile("/usuarios.json");
JSONArray listaUsuario = usuarios.getJSONArray("Usuarios");
JSONObject newObject = new JSONObject();
newObject.put("nombre", nombre);
newObject.put("apellido", apellido);
newObject.put("direccion", direccion);
newObject.put("telefono", telefono);
newObject.put("correo", correo);
newObject.put("username",username);
newObject.put("password", password);
listaUsuario.put(newObject);
usuarios.put("Usuarios",listaUsuario);
ObjectOutputStream outputStream = null;
outputStream = new ObjectOutputStream(new FileOutputStream("C:\\Users\\Victor\\eclipse-workspace\\Iplane\\assets\\usuarios.json"));
outputStream.writeUTF(usuarios.toString());
outputStream.flush();
outputStream.close();
}catch(JSONException e) {
e.printStackTrace();
}catch(Exception e) {
System.err.println("Error writting json: " + e);
}
So, if in my "create user" JFrame window ,I create a new user with "asdf" as info within all the user's details, I should get the following JSON file:
{"Usuarios":[{"password":"admin","apellido":"Admin","correo":"Adminadmin.com","direccion":"Admin","telefono":"Admin","nombre":"Admin","username":"admin"},{"password":"asdf","apellido":"asdf","correo":"asdf","direccion":"asdf","telefono":"asdf","nombre":"asdf","username":"asdf"}]}
And yes! that happens! but I got also, some weird ascii/Unicode symbols in front if my JSON main object. I cant copy the output here, so this is my output on imgur: link.
Why this problem happens? how could I fix it?
If someone need my json file reader (maybe the problem is there) here you go:
public static InputStream inputStreamFromFile(String path) {
try {
InputStream inputStream = FileHandle.class.getResourceAsStream(path); //charge json in "InputStream"
return inputStream;
}catch(Exception e) {
e.printStackTrace(); //tracer for json exceptions
}
return null;
}
public static String getJsonStringFromFile(String path) {
Scanner scanner;
InputStream in = inputStreamFromFile(path); //obtains the content of the .JSON and saves it in: "in" variable
scanner = new Scanner(in); //new scanner with inputStream "in" info
String json= scanner.useDelimiter("\\Z").next(); //reads .JSON and saves it in string "json"
scanner.close(); //close the scanner
return json; //return json String
}
public static boolean objectExists (JSONObject jsonObject, String key) { //verifies whether an object exist in the json
Object o;
try {
o=jsonObject.get(key);
}catch(Exception e) {
return false;
}
return o!=null;
}
public static JSONObject getJSONObjectFromFile(String path) { //creates a jsonObject from a path
return new JSONObject(getJsonStringFromFile(path));
}
So, after writing in JSON file, I cant do anything with it, because with this weird symbols, I got errors in my json: "extraneus input: (here are the symbols) expecting [STRING, NUMBER, TRUE, FALSE, {..."

writeUTF does not write standard unicode but prepends the output with two bytes of length information
If you use writeUTF intentionally, you have to use readUTF to read the data again. Otherwise I would suggest using an OutputStreamWriter.
writeUTF()
Writes two bytes of length information to the output stream, followed
by the modified UTF-8 representation of every character in the string
s. If s is null, a NullPointerException is thrown. Each character in
the string s is converted to a group of one, two, or three bytes,
depending on the value of the character.
** Edit to clarify OutputStreamWriter:
To use the OutputStreamWriter just replace the ObjectOutputStream with OutputStreamWriter and use write instead of writeUTF.
You might find this small tutorial helpfull: Java IO: OutputStreamWriter on jenkov.com

Java - Printing unicode from text file doesn't output corresponding UTF-8 character

I have this text file with numerous unicodes and trying to print the corresponding UTF-8 characters in the console but all it prints is the hex string. Like if I copy any of the values and paste them into a System.out it works fine, but not when reading them from the text file.
The following is my code for reading the file, which contains lines of values like \u00C0, \u00C1, \u00C2, \u00C3 which are printed to the console and not the values I want.
private void printFileContents() throws IOException {
Path encoding = Paths.get("unicode.txt");
try (Stream<String> stream = Files.lines(encoding)) {
stream.forEach(v -> { System.out.println(v); });
} catch (IOException e) {
e.printStackTrace();
}
}
This is the method I used to parse html that had the unicodes in the first place.
private void parseGermanEncoding() {
try
{
File encoding = new File("encoding.html");
Document document = Jsoup.parse(encoding, "UTF-8", "http://example.com/");
Element table = document.getElementsByClass("codetable").first();
Path f = Paths.get("unicode.txt");
try (BufferedWriter wr = new BufferedWriter(new FileWriter(f.toFile())))
{
for (Element row : table.select("tr"))
{
Elements tds = row.select("td");
String unicode = tds.get(0).text();
if (unicode.startsWith("U+"))
{
unicode = unicode.substring(2);
}
wr.write("\\u" + unicode);
wr.newLine();
}
wr.flush();
wr.close();
}
} catch (IOException e)
{
e.printStackTrace();
}
}

You will need to convert the string from unicode encoded string to UTF-8 encoded string. You could follow the steps, 1.convert the string to byte array using myString.getBytes("UTF-8") and 2.get the UTF-8 encoded string using new String(byteArray, "UTF-8"). The code block needs to be surrounded with try/catch for UnsupportedEncodingException.

Thanks to OTM's comment above I was able to get a working solution for this. You take the unicode string, convert to hex using Integer.parseInt() and finally casting to char to get the actual value. This solution is based on this post provided by OTM - How to convert a string with Unicode encoding to a string of letters
private void printFileContents() throws IOException {
Path encoding = Paths.get("unicode.txt");
try (Stream<String> stream = Files.lines(encoding)) {
stream.forEach(v ->
{
String output = "";
// Takes unicode digits and converts to HEX value
int parse = Integer.parseInt(v, 16);
// Get the actual value of the hex value
output += (char) parse;
System.out.println(output);
});
} catch (IOException e) {
e.printStackTrace();
}
}

Java XML Parsing - incorrect string version of the data with VTD-XML

I am parsing an XML document in UTF-8 encoding with Java using VTD-XML.
A small excerpt looks like:
<literal>𠀋</literal>
<literal>𠂉</literal>
<literal>𠂢</literal>
I want to iterate through each literal and print it out to the console. However, what I get is:
¢
I am correctly navigating to each element. The way that I get the text value is by calling:
private static String toNormalizedString(String name, int val, final VTDNav vn) throws NavException {
String strValue = null;
if (val != -1) {
strValue = vn.toNormalizedString(val);
}
return strValue;
}
I've also tried vn.getXPathStringVal();, however it yields the same results.
I know that each of the literals above aren't just strings of length one. Rather, they seem to be unicode "characters" composed of two characters. I am able to correctly parse and output the kanji characters if they're length is just one.
My question is - how can I correctly parse and output these characters using VTD-XML? Is there a way to get the underlying bytes of the text between the literal tags so that I can parse the bytes myself?
EDIT
Code to process each line of the XML - converting it to a byte array and then back to a String.
try (BufferedReader br = new BufferedReader(new FileReader("res/sample.xml"))) {
String line;
while ((line = br.readLine()) != null) {
byte[] myBytes = null;
try {
myBytes = line.getBytes("UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
System.exit(-1);
}
System.out.println(new String(myBytes));
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}

You are probably trying to get the string involving characters that is greater than 0x10000. That bug is known and is in the process of being addressed... I will notify you once the fix is out.
This question may be identical to this one...
Map supplementary Unicode characters to BMP (if possible)

How do I convert List<String[]> values from UTF-8 to String?

I want to convert some greek text from UTF-8 to String, because they cannot be recognized by Java. Then, I want to populate them into a JTable. So I use List to help me out. Below I have the code snippet:
String[][] rowData;
List<String[]> myEntries;
//...
try {
this.fileReader = new FileReader("D:\\Book1.csv");
this.reader = new CSVReader(fileReader, ';');
myEntries = reader.readAll();
//here I want to convert every value from UTF-8 to String
convertFromUTF8(myEntries); //???
this.rowData = myEntries.toArray(new String[0][]);
} catch (FileNotFoundException ex) {
Logger.getLogger(VJTable.class.getName()).log(Level.SEVERE, null, ex);
} catch (IOException ex) {
Logger.getLogger(VJTable.class.getName()).log(Level.SEVERE, null, ex);
}
//...
I created a method
public String convertFromUTF8(List<String[]> s) {
String out = null;
try {
for(String stringValues : s){
out = new String(s.getBytes("ISO-8859-1"), "UTF-8");
}
} catch (java.io.UnsupportedEncodingException e) {
return null;
}
return out;
}
but I cannot continue, because there is no getBytes() method for List.
What should I do. Any idea would be very helpful. Thank you in advance.

The problem is your use of FileReader which only supports the "default" character set:
this.fileReader = new FileReader("D:\\Book1.csv");
The javadoc for FileReader is very clear on this:
The constructors of this class assume that the default character
encoding and the default byte-buffer size are appropriate. To specify
these values yourself, construct an InputStreamReader on a
FileInputStream.
The appropriate way to get a Reader with a character set specified is as follows:
this.fileStream = new FileInputStream("D:\\Book1.csv");
this.fileReader = new InputStreamReader(fileStream, "utf-8");

To decode UTF-8 bytes to Java String, you can do something like this (Taken from this)
Charset UTF8_CHARSET = Charset.forName("UTF-8");
String decodeUTF8(byte[] bytes) {
return new String(bytes, UTF8_CHARSET);
}
Once you've read the data into a String, you don't have control over encoding anymore. Java stores Strings as UTF-16 internally. If the CSV file you're reading from is written using UTF-8 encoding, you should read it as UTF-8 into the byte array. And then you again decode the byte array into a Java String using above method. Now once you have the complete String, you can probably think about splitting it to the list of Strings based on the delimiter or other parameters (I don't have clue about the data you've).

Parse text from webpage in Java (not html)

I am using this code to download a string from a website:
static public String getLast() throws IOException {
String result = "";
URL url = new URL("https://www.bitstamp.net/api/ticker/");
BufferedReader in = new BufferedReader(new InputStreamReader(
url.openStream()));
String str;
while ((str = in.readLine()) != null) {
result += str;
}
in.close();
return result;
}
When I print the result of this method, this is what I get:
{"high": "349.90", "last": "335.23", "timestamp": "1384198415", "bid": "335.00", "volume": "33743.67611671", "low": "300.28", "ask": "335.23"}
That's exactly what is shown when you open the URL. This works fine for me, but if there is a more efficient way to do this please let me know.
What I need to extract is 335.23. This number is constantly changing, but the words such as "high", "last", "timestamp", etc always stay the same. I need to extract the 335.23 as a double. Is this possible?
Edit:
SOLVED
String url = "https://www.bitstamp.net/api/ticker/";
try {
JsonFactory factory = new JsonFactory();
JsonParser jParser = factory.createParser(new URL(url));
while (jParser.nextToken() != JsonToken.END_OBJECT) {
String fieldname = jParser.getCurrentName();
if ("last".equals(fieldname)) {
jParser.nextToken();
System.out.println(jParser.getText());
break;
}
}
jParser.close();
} catch (JsonGenerationException e) {
e.printStackTrace();
} catch (JarException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}

This is JSON. Use a good parser like Jackson. There are also good Tutorials available.

The response is a json. Use a java JSON Parser and get value for "high" element.
One of the java json parsers is available on (http://www.json.org/java/index.html)
JSONObject obj = new JSONObject(" .... ");
String pageName = obj.getString("high");

The data String that you have received is known as JSON encoding. JSON (JavaScript Object Notation) is a lightweight data-interchange format. Use a fine grain simple json encoder and decoder to encode and decode data.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jackson JSON parser invalid utf-8 start byte - java

I have also faced similar issue. open your json in Notepad ++, then in encoding drop down select as UTF-8. and save the text to other file. doing this resolved the issue.

Everything mentioned here I tried and none solved my issue, so I manually typed the payload and it solved my issue.

Related

ObjectOutputStream.writeUTF writes corrupt characters at the start

Java - Printing unicode from text file doesn't output corresponding UTF-8 character

Java XML Parsing - incorrect string version of the data with VTD-XML

How do I convert List<String[]> values from UTF-8 to String?

Parse text from webpage in Java (not html)

Categories

Resources