Recent Posts
Tags
 

Cleaning Contact Information with MySQL

Cleaning Contact Information with MySQL

11:03 19 May in Contact Management, Data Cleaning, MySQL, PHP, REGEX, Thomas' Toolkit
0 Comments

Last week I came in to bat clean up on a large data project.  It was a long lead up before it got to my desk.  A database of some 30 thousand contacts was verified by a group of individuals who actually took the time to call and email every contact to see if their information was still accurate, a truly herculean effort.  But, in the end, because there were many people working on the project, there was a wide array of inconsistencies in files and the way they kept their data.  In the autopsy phase of the project, we discovered ways to prevent that sort of problem, but no sense wasting time on ‘woulda, coulda, shoulda’, and no looking back now.  I had to clean what I was provided.

To start, I uploaded all the contacts to a new MySQL table.  This required that all data from multiple sources get homogenized into a single CSV file to normalize all field data.  There after, I like to set all known bad entries to NULL.

Set as Null:

update
table_to_clean
set website = NULL
WHERE (website = 'no' OR website = '' OR website = 'n/a')

The website field had several different values, depending on who made the entry.  I did this for the other fields as well.

The email address field was the most important, as this list is going to drive electronic contact going forward.  I found this expression that does a good job of identifying a proper email address format: ‘^[A-Z0-9._%-]+@[A-Z0-9.-]+.[A-Z]{2,4}$’

First, to find all bad email addresses:

SELECT * FROM
 table_to_clean 
WHERE 
email NOT REGEXP '^[A-Z0-9._%-]+@[A-Z0-9.-]+.[A-Z]{2,4}$';

After setting all empties and ‘n/a’s to NULL, I created a new ‘problem’ column to mark a bad email address.  This let me and another person revisit them later for another check.

UPDATE
table_to_clean 
SET problem = 'bad email address - '
WHERE 
email IS NOT NULL
AND
email NOT REGEXP '^[A-Z0-9._%-]+@[A-Z0-9.-]+.[A-Z]{2,4}$';

Upon further inspection, I found many bad emails where the result of leading and trailing whitespace characters. To clean whitespace, new lines, and returns:

UPDATE 
table_to_clean 
SET
email = trim(REPLACE(REPLACE(email,'r',''),'n',''))

TRIM() will knock of whitespace at the begining and end of the field, where the REPLACE() function replaced the r and n characters with an empty string.

I needed a final report on all unique and good email addresses, so I used GROUP BY `email ` as the last condition:

SELECT * FROM 
table_to_clean 
WHERE 
email REGEXP '^[A-Z0-9._%-]+@[A-Z0-9.-]+.[A-Z]{2,4}$'
GROUP BY email

Verifying the url format of the website addresses had problems similar to the email field. I NULLed all known empties to start, just like the email field.  A good query to check for basic proper url formatting might be:

SELECT * FROM
table_to_clean
WHERE website NOT REGEXP "^(https?://|www\.)[.A-Za-z0-9-]+\.[a-zA-Z]{2,4}"

One problem I had with website fields beyond whitespace, was bad leading a trailing characters. Someone in a few fields entered ‘.somedomain.com’, or ‘www.somedomain.com/index.html/’, or ‘www.somedomain.com.’. Using the TRIM function with either LEADING or TRAILING functions helped here.

UPDATE
table_to_clean
SET website = TRIM(LEADING '.' FROM website)
UPDATE
table_to_clean
SET website = TRIM(TRAILING '/' FROM website)

Now, people enter urls in all kids of ways. Some examples we have here are: ‘http://www.domain.com’, ‘http://domain.com’, ‘www.domain.com’, ‘domain.com’. Its the last 2 that I wanted to clean up, adding ‘http://’, so I used:

UPDATE
table_to_clean 
SET
website = concat("http://", website)
WHERE website NOT REGEXP "^(https?://|www\.)[.A-Za-z0-9-]+\.[a-zA-Z]{2,4}"
AND website REGEXP "^[.A-Za-z0-9-]+\.[a-zA-Z]{2,4}"

This added ‘http://’ to the front of the website entry via CONCAT(). I don’t need a ‘www’ (mostly…).

Another common problem with this data, and I’m sure data like it, is people entering email addresses where they mean urls, and urls where they meant email, so checking each field using the other’s regex pattern is advised.

Towards the end, I had to clean the other contact information: names, company names, address 1, address 2, and city. I did this in PHP, because it ended up being a lot quicker for me. (Mileage may vary.)

For phone and fax numbers, I used this function:

function fixPhone($string){
  $pattern = '/D*(?(d{3})?)?D*(d{3})D*(d{4})D*(d{1,8})?/';
  if (preg_match($pattern, $string, $match)){
    if ($match[3]){
      if ($match[1]){
        $num = $match[1].'-'.$match[2].'-'.$match[3];
      }else{
        $num = $match[2].'-'.$match[3];
      }
    }else{
      $num = NULL;
    }
    $match[4] ? $ext = $match[4] : $ext = NULL;
  }else{
    $num = NULL;
    $ext = NULL;
  }
  return $num;
}

An easy half of the contact information was entered all in CAPS. For Proper Casing on names for the contact, company, address and city fields, I used something like this:

//this turns 'THOMAS STUART HALL' into 'Thomas Stuart Hall'
//1. break an entry into an array by an empty space
$contactParts = explode(' ', $contactname); 
//2. create a new empty array for output
$newContactParts = array(); 
//3. loop through each name part
foreach($contactParts as $part){  
//4. convert all parts to lowercase, 
//   uppercase the first character, and add to output array
$newContactParts[] = ucfirst(strtolower($part));
}
//5. implode the pieces together with an empty space,
//   then a quick mysql_real_escape_string to sanitize
$newContact = mysql_real_escape_string(implode(' ', $newContactParts)); 

I also used the strtolower() function on the email and website fields as well, just to keep things even. Using PHP allowed me to do the final formatting all in one shot.

In the end, after running all these scripts, and a few more, we ended up with much cleaner data than we started with. This data is now ready for import into our commercial Customer Management System, which will be a whole other effort that will put this months long project to bed.

Now, for reference, here are the 3 most important regex patterns I used.
Email: ‘^[A-Z0-9._%-]+@[A-Z0-9.-]+.[A-Z]{2,4}$’

URL: ‘^(https?://|www\.)[.A-Za-z0-9-]+\.[a-zA-Z]{2,4}’

Phone and Fax (for PHP preg_match): ‘/D*(?(d{3})?)?D*(d{3})D*(d{4})D*(d{1,8})?/’

I hope this helps someone else.  It took me a few days to pull all these things together and add them to my toolbox.  Its my wish that someone else make use of it to save time. 😉

No Comments

Post A Comment

%d bloggers like this: