How I imported many Excel files into Python and then exported one text file
At my previous job, some clients for projects provided 90 to 100 Excel files that I needed to import into SQL Server or somewhere else. To do that I wanted to create one text file that could be imported into something else. In this case it required a little more work because the files had dates and times in their names but the date wasn’t in the data. I wanted to keep the date though, so I added a field for each file that contained the date and time part of the filename. I thought Python would be good for everything I wanted to do and it was. Here’s what I did.
First I imported the necessary libraries.
# Imports import os import pandas as pd import glob
Then I changed to the directory that contained the workbooks and put all of the filenames in a list.
# Change directory os.chdir("E:/Data/Client1/Excel Files") # Create a list with all the files path = os.getcwd() files = os.listdir(path)
I used list comprehension to get all the Excel workbooks. I could have used glob but wanted to use list comprehension for practice
# Select only xlsx files files_xlsx = [f for f in files if f[-4:] == "xlsx"]
I initialized an empty dataframe then imported the into the dataframe. I added the date and time as a field in this step, also replacing spaces and symbols in the column names. Then I exported each as a pipe delimited text file, with the filename being the date and time and the text files having no header or index.
# Initialize empty dataframe df = pd.DataFrame() # Loop over list of Excel files, import into dataframe, add date field, and export for f in files_xlsx: df = pd.read_excel(f, skiprows=4, skipfooter=3) # Add the date and time as a field df['file'] = f[-15:-5] # Replace spaces and symbols in column names df.columns = [c.replace(' ', '_') for c in df.columns] df.columns = [c.replace('#', '') for c in df.columns] col_names = df.columns # Export each as a text file df.to_csv(f[-15:-5]+".txt", sep="|", index=None, header=None)
I used glob to put the names of all the text files in a list.
# Get the .txt files files_txt = glob.glob("*.txt")
I initialized a new dataframe to hold all the data as well as an empty list.
# Initialize the dataframe to hold all the data df_full = pd.DataFrame() # Initialize an empty list df_list = 
Then I imported the files from file_txt, appending the data to the list. Then I added the data in the list to the dataframe using the pandas concat function.
# Loop over list of text files and import the data, then append the data to a list for f in files_txt: data = pd.read_csv(f, sep="|", header=None) df_list.append(data) # Add the data from the list to the dataframe for the full data set df_full = pd.concat(df_list, axis=0)
Since column names aren’t in the text files, I created a list that has them and then added those (the names have been changed except for file; that’s the one I added based on the Excel filename).
# Create the list with the column names col_names_full = ['Field1', 'Field2', 'Field3', 'Field4', 'Field5', 'Field6', 'Field7', 'Field8', 'Field9', 'Field10', 'Field11', 'Field12', 'Field13', 'file'] # Add the column names to the dataframe df_full.columns = col_names_full
Field4 has some NaN values that need to be replaced
# Replace NaN in Field4 with "None" df_full["Field4"] = df_full.Field4.fillna("None")
I created a dataframe that only contained a specific value from Field4, RSV.
# Create RSV only dataframe df_RSV = df_full[df_full.Field4 == "RSV"]
I created two outfile variables, one for the full dataframe and one for the RSV data only. Then I exported both as text delimited files.
# Outfile for RSV only outfile_RSV = "Data_RSV_20180312.txt" # Outfile for all data outfile_full = "Data_All_20180312.txt" # Export RSV as pipe delimited text df_RSV.to_csv(outfile_RSV, sep="|", index=None) # Export full as pipe delimited text df_full.to_csv(outfile_full, sep="|", index=None)
This produced two clean pipe delimited text files that could easily be imported elsewhere.