I have pdf documents and I want to parse them using MapReduce program. I have written a java program for parsing PDF files. I am using Apache PDFBox to parse them. I am loading the PDF file using
String inputFile = "/home/edureka/pdf/abc.pdf";
File inputFile = new File(inputFile);
pdf = PDDocument.load(inputFile);
Now, I have to write a Map-reduce program to parse the pdf document. I am planning to use WholeFileInputFormat to pass the entire document as a single split.
How can i use PDFBox with SequenceFileFormat or WholeFileInputFormat?