Dustin K MacDonald

Menu
  • Home
  • About
  • Economic and Community Development
  • Nonprofit Management
    • Counselling and Service Delivery
    • Suicide Prevention / Crisis Intervention
  • Politics and Governance
  • Math and Statistics
  • Salesforce
Menu

Using Pandas to Iterate Through a List of Files

Posted on March 11, 2022March 11, 2022 by Dustin

Table of Contents

Introduction

For my Eastern University capstone, I had a few project ideas. One of them involved analyzing IRS XML data to predict nonprofit failure. Although I didn’t end up going with that idea (because so many of the records contain data quality issues), I did end up writing a number of data cleaning functions to help me process the 1.8 million XML records that I had downloaded. Some of these will be useful to others working on similar problems.

The Function

One of the functions I wrote was designed to identify Form 990s and move them to another folder. From there, I would use a second function to find the 501(c)(3)’s. It’s printed in full below and then I go through it line by line.

import pandas as pd
import os
import time

def rename990Files(path_to_files, new_path):
    #Create a list of files
    file_list = os.listdir('r' + path_to_files)

    #start counting
    start_time = time.perf_counter()
    for i in file_list:
        #Read in the xml file
        df = pd.read_xml(file_list + i)
        #Filter for just the column containing ReturnType or ReturnTypeCd
        df = df.loc[:,df.columns.str.contains('ReturnType')]
        #If it's a 990, rename it to have _public_990 on the end
        if df.iloc[0][0] == 990:
            i_stripped = i.rstrip("_public.xml")
            new_name = i_stripped + '_public_990.xml'
            file_path = new_path
            os.rename(file_path + i,file_path + new_name)
            #Do nothing
    #stop counting
    end_time = time.perf_counter()
    print(f"Time {end_time - start_time:0.4f} seconds") 

Understanding the Function

We’ll walk through it line by line so you can understand it.

import pandas as pd
import os
import time

First we import the libraries that we need: pandas to do our data manipulation, os so we can find the right file path, and time so we can count how long this activity will take. This first pass took about 8 hours and I had other activities take 12+.

def fixFiles(path_to_files, new_path):
    #Create a list of files
    file_list = os.listdir('r' + path_to_files)

Next we create our function, with 2 arguments: the path to the files and the new path (where we want to save our files.) Because of the particular activity, I wanted to store the files in a different location.

We use os.listdir to create a list of files in the folder. The “R” is used to pass in a raw string, that means that if the files have spaces in them this function won’t throw an error.

#start counting
    start_time = time.perf_counter()

I’m using the time library to build a performance counter. By setting the start time here and stop time later, I can then print out the difference and know how long the task took.

   for i in file_list[-1]:
        #Read in the xml file
        df = pd.read_xml(file_list + i)
        #Filter for just the column containing ReturnType or ReturnTypeCd
        df = df.loc[:,df.columns.str.contains('ReturnType')]
        #If it's a 990, rename it to have _public_990 on the end
        if df.iloc[0][0] == 990:
            i_stripped = i.rstrip("_public.xml")
            new_name = i_stripped + '_public_990.xml'
            file_path = new_path
            os.rename(file_path + i,file_path + new_name)
    #stop counting
    end_time = time.perf_counter()
    print(f"Time {end_time - start_time:0.4f} seconds") 

This is a big block. Let’s move through it line by line. for i in file_list[-1] iterates through our file list except for the last entry, which is actually the name of the folder itself. It will throw an error if you try to open the folder like it’s a file.

The next line df = pd.read_xml(file_list + i) reads our XML file into a dataframe. file_list is the path we gave earlier (with double backslashes like C:\\Users\\Dustin\\xml_files\\ and i is the name of the specific file we’re opening like 201513099349201001.xml.

So, we read it into a dataframe. Then we need to find just the ReturnType column. We use df.loc to filter the dataframe for columns that contain “ReturnType” since some of them are named ReturnType and some are named ReturnTypeCd but they both have the info that we need.

#If it's a 990, rename it to have _public_990 on the end
        if df.iloc[0][0] == 990:
            i_stripped = i.rstrip("_public.xml")
            new_name = i_stripped + '_public_990.xml'
            file_path = new_path
            os.rename(file_path + i,file_path + new_name)

Next, if the return type is 990 (which is what we want), we generate a new file name by stripping the .xml off the end and replacing it with _public_990.xml. We then rename the file with our file name.

Outside the loop we can calculate and print our end-time:

  end_time = time.perf_counter()
    print(f"Time {end_time - start_time:0.4f} seconds") 

And that’s it!

Conclusion

These pandas and os are really useful for any bulk action that you might need to do. Have you used pandas in this way before or have any interesting data cleaning stories?

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Privacy Policy

See here for our privacy policy. This site uses affiliate links and Adsense ads to provide targeted advertising.

Tags

analytical technique assessment communication community development counselling crisis chat crisis intervention data science eastern university economic development education empathy evaluation forecasting fundraising governance information technology intelligence intelligence analysis keokuk county language learning legal management peer support personal development phd politics professional development protective factors psychosocial risk factors safety planning salesforce sigourney social media statistics suicide suicide assessment suicide risk assessment technology terrorism training university of the cumberlands violence risk assessment youth

Recommended Posts

  • Conducting Psychosocial Assessments
  • DCIB Model of Suicide Risk Assessment
  • ABC Model of Crisis Intervention
  • My Friend is Suicidal - What do I do?

Recent Posts

  • ITS833 Information Governance
  • Enhanced Care Management (ECM) with Salesforce
  • ITS835 Enterprise Risk Management
  • Glorifind Christian Search Engine
  • Sigourney Iowa Election Results, 2023

Archives

  • November 2023 (6)
  • October 2023 (1)
  • September 2023 (3)
  • August 2023 (1)
  • July 2023 (1)
  • May 2023 (1)
  • March 2023 (1)
  • February 2023 (2)
  • January 2023 (4)
  • December 2022 (2)
  • May 2022 (1)
  • April 2022 (2)
  • March 2022 (1)
  • February 2022 (1)
  • December 2021 (1)
  • October 2021 (1)
  • August 2021 (2)
  • May 2021 (3)
  • December 2020 (1)
  • November 2020 (4)
  • July 2020 (1)
  • June 2020 (1)
  • April 2020 (1)
  • March 2020 (4)
  • February 2020 (7)
  • January 2020 (1)
  • November 2019 (1)
  • October 2019 (2)
  • September 2019 (4)
  • August 2019 (2)
  • March 2019 (1)
  • February 2019 (1)
  • January 2019 (1)
  • December 2018 (4)
  • November 2018 (3)
  • October 2018 (3)
  • September 2018 (19)
  • October 2017 (2)
  • September 2017 (2)
  • August 2017 (1)
  • July 2017 (39)
  • May 2017 (3)
  • April 2017 (4)
  • March 2017 (4)
  • February 2017 (4)
  • January 2017 (5)
  • December 2016 (4)
  • November 2016 (4)
  • October 2016 (5)
  • September 2016 (4)
  • August 2016 (5)
  • July 2016 (5)
  • June 2016 (5)
  • May 2016 (3)
  • April 2016 (2)
  • March 2016 (2)
  • February 2016 (2)
  • January 2016 (4)
  • December 2015 (2)
  • November 2015 (2)
  • October 2015 (2)
  • September 2015 (2)
  • August 2015 (1)
  • June 2015 (2)
  • May 2015 (5)
  • April 2015 (3)
  • March 2015 (8)
  • February 2015 (12)
  • January 2015 (28)

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Recent Comments

  • Dustin on Starting a Crisis Line or Hotline
  • HAPPINESSHEALTHCOURAGE, LLC on Starting a Crisis Line or Hotline
  • ITS833 Information Governance - Dustin K MacDonald on University of the Cumberlands PhD in Information Technology
  • Elected Officials in Sigourney, Iowa - Dustin K MacDonald on Sigourney Iowa Election Results, 2023
  • ITS 835 Enterprise Risk Management - Dustin K MacDonald on University of the Cumberlands PhD in Information Technology

Tags

analytical technique assessment communication community development counselling crisis chat crisis intervention data science eastern university economic development education empathy evaluation forecasting fundraising governance information technology intelligence intelligence analysis keokuk county language learning legal management peer support personal development phd politics professional development protective factors psychosocial risk factors safety planning salesforce sigourney social media statistics suicide suicide assessment suicide risk assessment technology terrorism training university of the cumberlands violence risk assessment youth
© 2023 Dustin K MacDonald | Powered by Minimalist Blog WordPress Theme