November 17, 2023

Data Transformation with GPT-4 (Preppin Data, 2023, Week 31)

 

Environment:  ChatGPT Plus, GPT-4


Data:  Preppin Data, 2023, Week 31


Objective:  Use GPT-4 to transform data and fill in the missing IDs in file ee_dim_input.csv.



This data set was selected because it deals with HR data and HR data is almost alway complex.  The requirement is to fill in the missing IDs. The data set has 2 csv files:  (1) ee_dim_input.csv contains the list of employees and (2) ee_monthly_input.csv is a monthly snapshot of employees who worked during the month.



1)  Upload Data:  upload csv files.


The next 3 steps are for removing duplicated rows.



2)  Remove Duplicates:  remove duplicated rows with same employee_id in file ee_monthly_input.csv and use the output as a lookup table.  The output file is named ee_monthly_input_cleaned.csv





3)  Double-Check for Duplicates: double-check for duplicates for field ‘employee_id’ in file ee_monthly_input_cleaned.csv.  Result confirmed that there’s no duplicates.


Double-check for duplicates for field ‘guid’ in file ee_monthly_input_cleaned.csv.  Result confirmed there’s one duplicate.




4)  Remove Duplicates:  remove duplicated row.  The output is named ee_monthly_input_cleaned_updated.csv file.


The next 2 steps are for filling in the missing values in fields ‘employee_id’ and ‘guid’.



5)  Fill In Field ‘guid’:  link 2 files together by field ‘employee_id’ and fill in missing values for field ‘guid’.  The output is named ee_dim_input_updated_guid_linked_to_cleaned.csv file.






6)  Fill in Field ‘employee_id’:  link 2 files together by field ‘guid’ and fill in missing values for field ‘employee_id’.  The output is named ee_dim_input_updated_employee_id_linked_to_cleaned.csv file.  This file is the final result.







Verdict:  It took me more than 1 hour to transform this data set.  The challenge was knowing how to provide the proper prompts so GPT-4 would do what were needed.  The final result met the requirement.  


GPT-4 is remarkable as it can transform data just like Tableau Prep or Alteryx.  The billion-dollar question is to figure out how to integrate GPT-4 within the corporate IT systems so that it can connect to databases, work with millions of records, and refresh data on schedule.


No comments:

Post a Comment