Scenerio: Parse the PDF Bank statement and transform into clean and formatted csv file.
What I've tried: I manage to parse the pdf file(tabular format) using camelot library but failed to produce the desired result in sense of formatting.
Code:
import camelot
import pandas as pd
tables = camelot.read_pdf('test.pdf', pages = '3')
for i, table in enumerate(tables):
print(f'table_id:{i}')
print(f'page:{table.page}')
print(f'coordinates:{table._bbox}')
tables = camelot.read_pdf('test.pdf', flavor='stream', pages = '3')
columns = df.iloc[0]
df.columns = columns
df = df.drop(0)
df.head()
for c in df.select_dtypes('object').columns:
df[c] = df[c].str.replace('$', '')
df[c] = df[c].str.replace('-', '')
def convert_to_float(num):
try:
return float(num.replace(',',''))
except:
return 0
for col in ['Deposits', 'Withdrawals', 'Balance']:
df[col] = df[col].map(convert_to_float)
My_Result:

Desired_Output:

The logic I came up with is to move those rows up i guess n-1 if date column is NaN i don't know if this logic is right or not.Can anyone help me to sort out this properly?
I tried pandas groupby and aggregation functions but it only merging the whole data and removing NaN and duplicate dates which is not suitable because every entry is necessary.