Using PHP to get all file names in a folder stored in HDFS

Question

How can get all the file names of files in a specific folder in my HDFS. Technically, i should be looping all the files in the folder and then catching the file names. Is there a way to do this?
I know i should be using PHP cURL to access the webHDFS but i can't find an appropriate code.

For e.g. i wish to get the below filenames from the folder folder11 and store them in variables in PHP:

Bhavish · Answer 1 · Mar 13, 2019

So i found a workaround for the above problem, it's basically another scenario.

What i did is, instead of uploading files to hadoop using copyFromLocal, i used PHP cURL. I will try to explain this step by step.
So you create a php script and insert the below codes:

function call_curl($headers, $method, $url, $data,$file,$size) {

    $handle = curl_init();

    curl_setopt($handle, CURLOPT_URL, $url);

    curl_setopt($handle, CURLOPT_HTTPHEADER, $headers);

    curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);

    curl_setopt($handle, CURLOPT_SSL_VERIFYHOST, false);

    curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);

    curl_setopt($handle, CURLOPT_SAFE_UPLOAD, true); 

    switch($method) {

            case 'GET':

            break;

            case 'POST':

                curl_setopt($handle, CURLOPT_POST, true);

                curl_setopt($handle, CURLOPT_POSTFIELDS, $data);

            break;

            case 'PUT': 

                curl_setopt($handle, CURLOPT_CUSTOMREQUEST, 'PUT');

                curl_setopt($handle, CURLOPT_POSTFIELDS, $data);

                curl_setopt($handle, CURLOPT_INFILE, $file);

                curl_setopt($handle, CURLOPT_INFILESIZE, $size);

            break;

            case 'DELETE':

                curl_setopt($handle, CURLOPT_CUSTOMREQUEST, 'DELETE');

            break;

    }

    $response = curl_exec($handle);

    $code = curl_getinfo($handle, CURLINFO_HTTP_CODE);

    curl_close($handle);

    return $response;

}

The above function will make a request to hadoop and perform operation that we will define below. In our case that will be PUT since we are uploading to hadoop.

Now i will define a path to the folder containing the files that i wish to upload.

$dir = '/var/www/html/myData';

Now i will create a for loop that will loop through all the files to get all the filenames and afterwards get other details such as file size, file path etc (depends on what you need, basically in my case, i am storing each file details in my database).

foreach (new DirectoryIterator($dir) as $fileInfo) {    

    if($fileInfo->isDot()) continue;    

    $filename = $fileInfo->getFilename();

     

//start curl php session to connect and upload file on hdfs

$header = array('Content-Type: application/octet-stream');

$method = "PUT";


//Path of zip folder 

$filepath=$dir."/".$filename;

//echo $filepath;

//size of zip folder

$size = filesize($filepath);

//echo $size;

//verify method of storing the zip folder

if ($size<=(1024*1024*1024)){

$rep = "3WayReplication";

}else{

$rep="erasure";

}

$url="http://chbpc-VirtualBox:9864/webhdfs/v1/".$username."/".$rep."/".$filename."?op=CREATE&namenoderpcaddress=localhost:9000&createflag=&createparent=true&overwrite=false";

$file=fopen($filepath, 'r');

//echo $file;

$filedata =  fread($file,$size);

//echo $filedata;

$data = array($filedata, $target_file);

//echo "<br/>data: ".$data;

//echo "<br>";

//echo "url: " . $url;

call_curl($header, $method, $url, $data,$file,$size);


//Store file details into database

$m = new MongoClient();

$collection = $m->ecoss->fileInfo;

$document = array(

"rootFolder" => $username,

"fileName" => $filename,

"filePath" => $filepath,

"fileSize" => ($size/1024)."kb",

"replicationType" => $rep,

"uploadDate" => $date,

"uploadTime" => $time

);

$collection->insert($document);

//echo "Document inserted successfully";


//Allow permission chmod 777 in root folder for deletion of files

if (!unlink($filepath)) {

  //echo ("Error deleting ".$filename);

}

else{

  //echo ("Deleted ".$filename);

}

}

As you can see above, i am using mongoDB to store my file details. This is why i posted the above question since i was using copyFromLocal , there is no way i would get the file information that was being uploaded to my hdfs. Using php cURL, i have my file names stored in the variable $filename and i am able to store in my database. You can just skip the mongoDB codes if you don't need it.

Now there is something very important, that is the $url. You can just copy paste the above codes, that is absolutely fine, but the url in your case would be different. To get your own $url

Basically you need to have curl installed on your machine.
Now, open your terminal (as per the previous hadoop apache link), type your curl command:

curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE

where <HOST> is your hostname that you can find by typing on the terminal:

hostname

<PORT> is the port you use to connect to hadoop and <PATH> is the path to the folder where you wish to upload your files to. After issuing this first command, you will get a response like:

After this, issue another curl command:

curl -i -X PUT -T <LOCAL_FILE>

Here <LOCAL_FILE> is the path to the file that you wish to upload to hdfs. After that, copy the location that you received after having issued the first command. Im my case, the location is:

http://chbpc-VirtualBox:9864/webhdfs/v1/test4/datafile?op=CREATE&namenoderpcaddress=localhost:9000&createflag=&createparent=true&overwrite=false

So basically, the above location, is what you need to put in your $url variable.

$url = "http://chbpc-VirtualBox:9864/webhdfs/v1/test4/datafile?op=CREATE&namenoderpcaddress=localhost:9000&createflag=&createparent=true&overwrite=false";

Naturally, before testing your php script, you can try to upload a file to hdfs through the terminal to test if everything is fine. You just run the second curl command which should be something like this after adding the location:

curl -i -X PUT -T /var/www/html/myData/21\ February\ 2019\ 11_31_55\ PM "http://chbpc-VirtualBox:9864/webhdfs/v1/test4/datafile?op=CREATE&namenoderpcaddress=localhost:9000&createflag=&createparent=true&overwrite=false"

Check in your hdfs if the file has been uploaded. If the file has been uploaded, then your $url should be good.

The code below will just delete all the files in my folder that has already been uploaded to my hdfs.

//Allow permission chmod 777 in root folder for deletion of files

if (!unlink($filepath)) {

  //echo ("Error deleting ".$filename);

}

else{

  //echo ("Deleted ".$filename);

}

Before you run the php script, delete the files that has been uploaded using the command line curl (just in case you are uploading the same file with php curl). Now run your file and check if it has been uploaded to your hdfs.

Please excuse me if this is a long reply, i have tried to give max details for this solution since it was quite some struggle for me to make this work.
I hope you can use this as a good reference.

Omkar · Answer 2 · Mar 11, 2019

Hi @Bhavish.

Did some research and found this API for PHP on github: https://github.com/adprofy/Php-Hadoop-Hdfs

I think this part is what you are interested in:

See the link above for documentation on setting up the API for PHP.

Then you can try something like this:

$ hdfsDir = <path to directory whose files you want to list>
$hdfs -> readDir ($hdfsDir)

Please see if this works for you.

answered Mar 11, 2019 by Omkar
• 69,180 points

Show 7 previous comments

@Karan not sure if that would change anything, but i found a work around for this. I will post the solution soon. It is something quite different but has solved my problem.