I'm trying to compare worksheets in a lot of different excel files, however it's a tough problem. The issue appears to be as follows:
I have a directory with roughly 1000.xlsx files, each with 16 worksheets and dated sequentially. Each file contains a forecast of the numbers for a specific day using data from four subsequent days plus, if necessary, the previous week's Saturday.
For instance, the filename for the first five worksheets in it would be ['10-12','13-12','14-12','15-12','16-12'] if I were to predict the 17th of December 2022. The 11th and 12th are omitted because the 10th falls on a Saturday.
Lets call this file 1.
This file will be called xxx 1218.xlsx and contain the following 5 sheets: "17-12," "13-12," "14-12," "15-12," and "16-12." The following file in the folder will be the following day, the 18th of December 2022, and contains 4 of the same days plus a new one, which is the actual result (not the day we were predicting) that we were trying to predict. The 17th is a Saturday, therefore the order is irrelevant.
Lets call this file 2.
The challenge now is. I want to make sure that all of the worksheets in file 1, which have the same dates as file 2, have the same information. Make the same comparison after that by iteratively going through each subsequent file. Hence, after comparing the worksheets from files 1 and 2, file 2's worksheets should be compared to files 3 and so on.
I currently have the following, which performs a number of complex acrobatics to obtain the names of the worksheets from the directory, a list of the pertinent ones from the dictionary keys, and then calls them in dataframes to compare them.
import pandas as pd
import os
from datetime import date
path_root =r'C:\Users\Files'
filenames = [file for file in os.listdir(path_root) if file.endswith('.xlsx')]
keys_list = []
df1_keys_list = []
df2_keys_list = []
for i in range(len(filenames)):
filename = filenames[i]
filename_compare = filenames[i+1]
day = filename[-7:-5]
month = filename[-9:-7]
day_compare = filename_compare[-7:-5]
month_compare = filename_compare[-9:-7]
filename_compare = 'prealign_debug_'+month_compare+day_compare+'.xlsx'
Path_1 = os.path.join(path_root,filename)
Path_2 = os.path.join(path_root,filename_compare)
df1 = pd.read_excel(Path_1, sheet_name=None)
df2 = pd.read_excel(Path_2, sheet_name=None)
for j in df1.keys():
df1_keys_list.append(j)
df1_days = df1_keys_list[0:5]
for k in df2.keys():
df2_keys_list.append(k)
df2_days = df2_keys_list[0:5]
df3 = pd.read_excel(Path_1, sheet_name=df1_days[1])
df4 = pd.read_excel(Path_1, sheet_name=df1_days[2])
df5 = pd.read_excel(Path_1, sheet_name=df1_days[3])
df6 = pd.read_excel(Path_2, sheet_name=df1_days[0])
df7 = pd.read_excel(Path_2, sheet_name=df1_days[1])
df8 = pd.read_excel(Path_2, sheet_name=df1_days[2])
[print('correct') if df3.equals(df6) else print('incorrect')]
[print('correct') if df4.equals(df7) else print('incorrect')]
[print('correct') if df5.equals(df8) else print('incorrect')]
This is a bit cumbersome, probably there's a slicker way of doing it but this is what I have anyway.
Two problems arise, one: Something is wrong with my loop structure here
for j in df1.keys():
df1_keys_list.append(j)
df1_days = df1_keys_list[0:5]
as 95 instead of 5 elements should have been in the list df1 keys list. There might be a more effective approach to create the keys list that references the worksheets.
Second, I can tell it's not functioning since I'm comparing 5 of the exact same files, and when I make a change to one of them, the script still prints a "right" statement, EXCEPT when I make a change to the second file.
How can I tidy this up and what am I doing wrong?