Pyspark is taking default path

Question

from pyspark import SparkFiles
rdd=sc.textFile("emp/employees/part-m-00000")
rdd.map(lambda line: line.upper()).collect()

This code is executing with no issues . But my file is present in
/user/edureka_536711/emp/employees/ part-m-00000

I am not sure how the path /user/edureka_536711/ is passing by default and below code is failing :

def get_hdfspath(filename):
my_hdfs="user/{0}".format(user_id.lower())
return os.path.join(my_hdfs,filename)
rdd=sc.textFile(sample)
rdd.map(lambda line: line.upper()).collect()

Can you help here?

score 0 · Answer 1 · Jul 16, 2019

The HDFS path for MyLab is /user/edureka_id. So, by default, it will take that path even if you do not mention it. As for example if a textfile abc.txt is present the Hadoop path. Then if you mention /user/edureka_id/abc.txt or only abc.txt, both will be the same.

Regarding the code

def get_hdfspath(filename):
my_hdfs="user/{0}".format(user_id.lower())
return os.path.join(my_hdfs,filename)
rdd=sc.textFile(sample)
rdd.map(lambda line: line.upper()).collect()

Hope this helps!

Join PySpark Training online today to know more about Pyspark.

Thanks.

Learn how to set checkpiont dir PySpark Data Science Experience?