How to find in incorrect file records in hive

Question

Suppose 1000 records are present in one Json file and saving all records in HIVE Table. In that records one record is incorrect, how to find that error record?

score 0 · Answer 1 · Jul 25, 2019

A value with a wrong datatype causes the generated MR job to crash. ignore.malformed.json does not seem to fix it.

Here is the sample data, mixed2.json

{"f1":"hello", "f2":7}

{"f1":"goodbye", "f2":8}

{"f1":"this", "f2":9}

{"f1":"that", "f2":"ten"}

Here is the sample Hive script, mixed2.hive. The first query (on f1) works. The other queries (on * and f2) crash. It would be nice to see NULL or something else. The get_json_object() function actually returns the bad string, so it prints "ten"!

drop table mixed2;

create table mixed2 (f1 string, f2 int)

row format serde 'org.openx.data.jsonserde.JsonSerDe'

with serdeproperties ("ignore.malformed.json" = "true")

stored as textfile;


load data inpath '/tmp/mixed2.json' overwrite into table mixed2;


select f1 from mixed2;

select f2 from mixed2;

select * from mixed2;

You should declare then the column as "String" instead of int. The SerDe will be able to read the numbers into strings, then you can CAST them in hive.

Abnormalities upto some extent can be taken care of but if the schema entirely changes then we can't load data at all.