Splitting Strings: How to Work with Consistent Data in SQL Server
How do you provide a consistent pattern of data when it's in different formats? Today, we write a Split function to make data consistent in our tables.
If you're a developer, you always find sanctuary in code, but there are times when you're presented with data in an unusual format and things can get...strange.
For example, I'm working with a client where they provide me with an Excel spreadsheet of rows and want to import the data into SQL Server.
So until I get these maintenance screens in place, I continue to import the data. I already wrote one of the data screens based on a previous post I wrote called Import a Data Hierarchy From Excel into SQL Server.
Now I know what you're thinking. Why not use DTS (Data Transformation Services) of SQL Server?
When importing the data, you have an option to write a SQL Statement or define the columns and their sizes. While this provides a robust front-end for importing data, there are more inconsistencies to this particular approach.
I started down this path and found it way too time-consuming so I decided to look for another way.
Back to the spreadsheet.
In this Excel spreadsheet, the data wasn't consistent at all.
- Names are in one cell ("John Doe") instead of in separate cells (First Name, John, Last Name Doe).
- Phone Numbers are in different formats ('(999) 999-9999', '999-999-9999', '999.999.9999', '(999)999-9999 Ext.99', etc.)
Since we want our initial load of data to be consistent in the database, we need to "massage" the data into a usable format.
In today's post, since I've been wrestling with inconsistent data this weekend, I'll show a quick and simple splitter to help with manipulating strings which will save you hours of work.
Creating the Function
Since this is just for an initial load of data, this code is not meant for a production environment.
However, I've been using this function for a long time. It's one of the SQL Server functions I use in my arsenal and I find it quite useful.
The function, when created, is stored in the Table-valued Functions (<database> > Programmability > Functions > Table-valued Functions) and is always available.
CREATE FUNCTION [dbo].[Split] ( @delimited NVARCHAR(MAX), @delimiter NVARCHAR(100)) RETURNS @t TABLE ( id INT IDENTITY (1, 1) ,val NVARCHAR(MAX) ) AS BEGIN DECLARE @xml XML SET @xml = N'<t>' + REPLACE(@delimited, @delimiter, '</t><t>') + '</t>' INSERT INTO @t (val) SELECT r.value('.', 'varchar(MAX)') AS item FROM @xml.nodes('/t') AS records (r) RETURN END
Very quick and easy. The dbo.Split function works like a table.
SELECT * FROM dbo.Split('John Doe',' ')
Execute this statement and you receive the following output.
As you can see, this makes splitting names very easy. If you want to save the values into the fields FirstName and LastName, it's as easy as:
DECLARE @FirstName VARCHAR(100); DECLARE @LastName VARCHAR(100); SELECT @FirstName=val FROM dbo.Split(@Person,' ') WHERE id=1
SELECT @LastName=val FROM dbo.Split(@Person,' ') WHERE id=2
Now, I've ran into a couple issues where the name in the Excel spreadsheet is "John A. Doe". If you have a middle initial, it will still split it out, but you'll have 3 records instead of two.
Perform a count on the resulting split to determine the best approach for your data.
For my needs, I only added the middle name variable to the first name variable and save the FirstName to the table.
What about the Phone Numbers?
The phone numbers required additional massaging to get the format we need.
The client's database requires all numbers in the field. No dashes, periods, or parentheses.
We need to remove any extraneous characters and perform the same split.
-- (In Excel spreadsheet, possible phone values are '(xxx) xxx-xxxx',
-- 'xxx-xxx-xxxx', 'xxx.xxx.xxxx', '(xxx)xxx-xxxx Ext.99'...or anything else). DECLARE @PhoneSplitter VARCHAR(30);
--Test SELECT @PhoneSplitter='999.999.9999'; -- Remove the area code parentheses '()' SELECT @PhoneSplitter = REPLACE(@PhoneSplitter,')',' '); SELECT @PhoneSplitter = REPLACE(@PhoneSplitter,'(',' '); -- Remove the periods SELECT @PhoneSplitter = REPLACE(@PhoneSplitter,'.',' '); -- Remove the dashes SELECT @PhoneSplitter = REPLACE(@PhoneSplitter,'-',' '); -- Remove extra spaces SELECT @PhoneSplitter = REPLACE(@PhoneSplitter,' ',' '); SELECT * FROM dbo.Split(@PhoneSplitter,' ');
Your output will consist of the following:
Concatenate, cast as an integer, and save them to your table.
Again, you may have a phone number which includes an extension ("ext. 2044" or "x64") meaning you'll have 4 records instead of 3.
Adjust as necessary for your data needs.
I find it ironic working with databases where I have problems manipulating data to my needs.
Maybe it's because I'm not a DBA or maybe I've been spoiled by C#'s string manipulation methods? ;-)
This particular function is one of my tried-and-true functions which I use every time I'm in SQL Server when I have to massage or manipulate data.
In today's post, I demonstrated how to use XML to perform string manipulation and how it's meant to be a starting point for full-stack developers to discover additional ways in writing their own database "functions" to speed up the unusual ways of data manipulation.
Was there a better way to perform this? Do you have an existing library of database functions? Post your comments below and let's discuss.