How to fix corrupt files on HDFS

Question

How does someone fix an HDFS that's corrupt? I looked on the Apache/Hadoop website and it said its fsck command, which doesn't fix it. Hopefully, someone who has run into this problem before can tell me how to fix this.

Unlike a traditional fsck utility for native file systems, this command does not correct the errors it detects. Normally NameNode automatically corrects most of the recoverable failures.

When I ran bin/Hadoop fsck / -delete, it listed the files that were corrupt or missing blocks. How do I make it not corrupt? This is on a practice machine so I COULD blow everything away but when we go live, I won't be able to "fix" it by blowing everything away so I'm trying to figure it out now.

ravikiran · Answer 1 · Jul 18, 2019

1 - Spark if following slave/master architecture. So on your cluster, you have to install a spark master and N spark slaves. You can run spark in a standalone mode. But using Yarn architecture will give you some benefits. There is a very good explanation of it here

2- It is necessary if you want to use Yarn or HDFS for example, but as i said before you can run it in standalone mode.