What are all the Data quality checks we do in our real time Big Data projects

+1 vote
What are all the Data quality checks we do in our real time Bigdata projects.
Example1: How can we find out the count of records loaded in hdfs and source are same?
Example2: How can we know the loaded records in hdfs are proper?
Sep 4, 2019 in Big Data Hadoop by Madhan
• 130 points
1,957 views

1 answer to this question.

+1 vote

You can use a checksum to compare the file in the source and the file uploaded on the hdfs. 

Try this: 

$ hdfs dfs -cat /file/in/hdfs | md5sum

$ hdfs dfs -cat /file/at/source | md5sum

If these two commands return the same value, then the file is not corrupted. 

answered Sep 4, 2019 by Tina
Thanks for the help, but i am copying from mysql table to hdfs in that scenario, if one record corrupted then how can we know that?

I'm not a 100% sure but I think you can use crc32 checksum as follows:

For the mysql table, use the below command to get the checksum:

CHECKSUM TABLE <tablename>;

And in hdfs, use this:

hdfs dfs -Ddfs.checksum.combine.mode=COMPOSITE_CRC -checksum /path/to/file

Related Questions In Big Data Hadoop

0 votes
1 answer

What is the use of Apache Kafka in a Big Data Cluster?

Kafka is a Distributed Messaging System which ...READ MORE

answered Jun 21, 2019 in Big Data Hadoop by ravikiran
• 4,620 points
934 views
0 votes
1 answer

What are the extra files we need to run when we run a Hive action in Oozie?

Hi, Below are the extra files that need ...READ MORE

answered Jun 26, 2019 in Big Data Hadoop by Gitika
• 65,770 points
743 views
0 votes
0 answers

what are the characteristics of big data?

Jan 5, 2022 in Big Data Hadoop by Edureka
• 13,620 points
397 views
0 votes
1 answer

what are the different features of big data analytics

Features of big data analytics are- Velocity The pace ...READ MORE

answered Feb 1, 2022 in Big Data Hadoop by CoolCoder
• 4,420 points
1,071 views
+1 vote
2 answers
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,028 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,535 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
108,830 views
0 votes
5 answers
0 votes
1 answer

What are the different relational operations in “Pig Latin” you worked with?

Different relational operators are: for each order by fil ...READ MORE

answered Dec 14, 2018 in Big Data Hadoop by Frankie
• 9,830 points
864 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP