Table of Contents
Introduction
For my Eastern University capstone, I had a few project ideas. One of them involved analyzing IRS XML data to predict nonprofit failure. Although I didn’t end up going with that idea (because so many of the records contain data quality issues), I did end up writing a number of data cleaning functions to help me process the 1.8 million XML records that I had downloaded. Some of these will be useful to others working on similar problems.
The Function
One of the functions I wrote was designed to identify Form 990s and move them to another folder. From there, I would use a second function to find the 501(c)(3)’s. It’s printed in full below and then I go through it line by line.
import pandas as pd import os import time def rename990Files(path_to_files, new_path): #Create a list of files file_list = os.listdir('r' + path_to_files) #start counting start_time = time.perf_counter() for i in file_list: #Read in the xml file df = pd.read_xml(file_list + i) #Filter for just the column containing ReturnType or ReturnTypeCd df = df.loc[:,df.columns.str.contains('ReturnType')] #If it's a 990, rename it to have _public_990 on the end if df.iloc[0][0] == 990: i_stripped = i.rstrip("_public.xml") new_name = i_stripped + '_public_990.xml' file_path = new_path os.rename(file_path + i,file_path + new_name) #Do nothing #stop counting end_time = time.perf_counter() print(f"Time {end_time - start_time:0.4f} seconds")
Understanding the Function
We’ll walk through it line by line so you can understand it.
import pandas as pd import os import time
First we import the libraries that we need: pandas to do our data manipulation, os so we can find the right file path, and time so we can count how long this activity will take. This first pass took about 8 hours and I had other activities take 12+.
def fixFiles(path_to_files, new_path): #Create a list of files file_list = os.listdir('r' + path_to_files)
Next we create our function, with 2 arguments: the path to the files and the new path (where we want to save our files.) Because of the particular activity, I wanted to store the files in a different location.
We use os.listdir to create a list of files in the folder. The “R” is used to pass in a raw string, that means that if the files have spaces in them this function won’t throw an error.
#start counting start_time = time.perf_counter()
I’m using the time library to build a performance counter. By setting the start time here and stop time later, I can then print out the difference and know how long the task took.
for i in file_list[-1]: #Read in the xml file df = pd.read_xml(file_list + i) #Filter for just the column containing ReturnType or ReturnTypeCd df = df.loc[:,df.columns.str.contains('ReturnType')] #If it's a 990, rename it to have _public_990 on the end if df.iloc[0][0] == 990: i_stripped = i.rstrip("_public.xml") new_name = i_stripped + '_public_990.xml' file_path = new_path os.rename(file_path + i,file_path + new_name) #stop counting end_time = time.perf_counter() print(f"Time {end_time - start_time:0.4f} seconds")
This is a big block. Let’s move through it line by line. for i in file_list[-1] iterates through our file list except for the last entry, which is actually the name of the folder itself. It will throw an error if you try to open the folder like it’s a file.
The next line df = pd.read_xml(file_list + i) reads our XML file into a dataframe. file_list is the path we gave earlier (with double backslashes like C:\\Users\\Dustin\\xml_files\\ and i is the name of the specific file we’re opening like 201513099349201001.xml.
So, we read it into a dataframe. Then we need to find just the ReturnType column. We use df.loc to filter the dataframe for columns that contain “ReturnType” since some of them are named ReturnType and some are named ReturnTypeCd but they both have the info that we need.
#If it's a 990, rename it to have _public_990 on the end if df.iloc[0][0] == 990: i_stripped = i.rstrip("_public.xml") new_name = i_stripped + '_public_990.xml' file_path = new_path os.rename(file_path + i,file_path + new_name)
Next, if the return type is 990 (which is what we want), we generate a new file name by stripping the .xml off the end and replacing it with _public_990.xml. We then rename the file with our file name.
Outside the loop we can calculate and print our end-time:
end_time = time.perf_counter() print(f"Time {end_time - start_time:0.4f} seconds")
And that’s it!
Conclusion
These pandas and os are really useful for any bulk action that you might need to do. Have you used pandas in this way before or have any interesting data cleaning stories?