Removing duplicate data
Database duplicate rows and data are considered one of the biggest threats that face SQL developers. It may also affect the behaviors of applications that rely on that database the contains those duplicate rows, by providing mistaken data.
A SQL server database designer could solve this problem easily by using UNIQUE constrains or IDENTITY keys, that for sure will prevent adding records the may break any of those keys or constrains, but most of the times or let’s say sometimes, data could be obtained from different resources (for example a text file) where data could be uncertain,miss-typed, and/or not following rules and instructions, these sources may have no abilities to remove those duplicates, especially when large amounts of data are involved.
The solution is to import those data to SQL server as is using SQL Server Integrations Services, then check if it contain any duplicates and finally, process those duplicates (move or delete). The following is a practical example of how to deal remove duplicate records using a cursor:
The input Data:
ID | FirstName | LastName | BirthDate | HiredDate |
| 1 | Ahmad | Ali | 1/1/1986 | 1/1/2005 |
| 1 | Ahmad | Ali | 1/1/1986 | 1/1/2005 |
| 1 | Ahmad | Ali | 1/1/1986 | 1/1/2005 |
| 2 | Mohammad | Ameen | 2/2/1980 | 1/1/2004 |
| 2 | Mohammad | Ameen | 2/2/1980 | 1/1/2004 |
| 3 | Bashar | Kokash | 1/1/1986 | 3/5/2008 |
| 4 | Asmaa | Saeed | 1/1/1987 | 1/1/2007 |
| 4 | Asmaa | Saeed | 1/1/1987 | 1/1/2007 |
| 5 | Morhaf | Mosa | 1/1/1983 | 1/1/2009 |
The T-SQL procedure:
CREATE PROCEDURE [dbo].[DELETEDUPLICATE]
AS
BEGIN
DECLARE @FIRSTNAME VARCHAR(50) ,
@LASTNAME VARCHAR(50),
@COUNT INT,
@BIRTHDATE DATETIME,
@HIREDDATE DATETIME,
@DELETEDROWS INT
SET @DELETEDROWS = 0
DECLARE C_EMPLOYEE CURSOR FOR
SELECT FIRSTNAME,LASTNAME, BIRTHDATE,HIREDDATE,COUNT(*)
FROM EMPLOYEE
GROUP BY FIRSTNAME,LASTNAME, BIRTHDATE,HIREDDATE
HAVING COUNT(*)>1 –This select statement gets only duplicate rows
OPEN C_EMPLOYEE
FETCH NEXT FROM C_EMPLOYEE INTO
@FIRSTNAME ,@LASTNAME ,@BIRTHDATE ,@HIREDDATE ,@COUNT
WHILE @@FETCH_STATUS =0
BEGIN
DELETE TOP (@COUNT-1) – if there is 5 copies of a record then delete 4 and leave one
FROM EMPLOYEE
WHERE FIRSTNAME = @FIRSTNAME
AND LASTNAME = @LASTNAME
AND BIRTHDATE = @BIRTHDATE
AND HIREDDATE = @HIREDDATE
SET @DELETEDROWS = @DELETEDROWS + @@ROWCOUNT
– gets the number of rows deleted
FETCH NEXT FROM C_EMPLOYEE INTO
@FIRSTNAME ,@LASTNAME ,@BIRTHDATE ,@HIREDDATE ,@COUNT
END
CLOSE C_EMPLOYEE
DEALLOCATE C_EMPLOYEE
RETURN @DELETEDROWS
END
The output data:
| ID | FirstName | LastName | BirthDate | HiredDate |
| 1 | Ahmad | Ali | 1/1/1986 | 1/1/2005 |
| 2 | Mohammad | Ameen | 2/2/1980 | 1/1/2004 |
| 3 | Bashar | Kokash | 1/1/1986 | 3/5/2008 |
| 4 | Asmaa | Saeed | 1/1/1987 | 1/1/2007 |
| 5 | Morhaf | Mosa | 1/1/1983 | 1/1/2009 |
The Result:
Now as a result we have a new unique copy of the same duplicate data with a little effort, that may save time and cost if duplicates were not processed