Recent Posts

data cleaning Tag

Cleaning Contact Information with MySQL

11:03 19 May in Contact Management, Data Cleaning, MySQL, PHP, REGEX, Thomas' Toolkit by Thomas

Last week I came in to bat clean up on a large data project.  It was a long lead up before it got to my desk.  A database of some 30 thousand contacts was verified by a group of individuals who actually took the time to call and email every contact to see if their information was still accurate, a truly herculean effort.  But, in the end, because there were many people working on the project, there was a wide array of inconsistencies in files and the way they kept their data.  In the autopsy phase of the project, we discovered ways to prevent that sort of problem, but no sense wasting time on 'woulda, coulda, shoulda', and no looking back now.  I had to clean what I was provided. To start, I uploaded all the contacts to a new MySQL table.  This required that all data from multiple sources get homogenized into a single CSV file to normalize all field data.  There after, I like to set all known bad entries to NULL. Set as Null: update table_to_clean set website = NULL WHERE (website = 'no' OR website = '' OR website = 'n/a') The website field had several different values, depending on who made the entry.  I did this for the other fields as well. The email address field was the most important, as this list is going to drive electronic contact going forward.  I found this expression that does a good job of identifying a proper email address format: '^[A-Z0-9._%-]+@[A-Z0-9.-]+.[A-Z]{2,4}$' First, to find all bad email addresses: SELECT...