Bashar Kokash' Blog

.Net Framework, windows and web development.

This site

Resources

Friends

Sponsors

  • MaximumASP
  • Packet Sniffer

Removing duplicate data

Database duplicate rows and data are considered one of the biggest threats that face SQL developers. It may also affect the behaviors of applications that rely on that database the contains those duplicate rows, by providing mistaken data.

A SQL server database designer could solve this problem easily by using UNIQUE constrains or IDENTITY keys, that for sure will prevent adding records the may break any of those keys or constrains, but most of the times or let’s say sometimes, data could be obtained from different resources (for example a text file) where data could be uncertain,miss-typed, and/or not following rules and instructions, these sources may have no abilities to remove those duplicates, especially when large amounts of data are involved.

The solution is to import those data to SQL server as is using SQL Server Integrations Services, then check if it contain any duplicates and finally, process those duplicates (move or delete). The following is a practical example of how to  deal remove duplicate records using a cursor:

The input Data:

ID

FirstName

LastName

BirthDate

HiredDate

1 Ahmad Ali 1/1/1986 1/1/2005
1 Ahmad Ali 1/1/1986 1/1/2005
1 Ahmad Ali 1/1/1986 1/1/2005
2 Mohammad Ameen 2/2/1980 1/1/2004
2 Mohammad Ameen 2/2/1980 1/1/2004
3 Bashar Kokash 1/1/1986 3/5/2008
4 Asmaa Saeed 1/1/1987 1/1/2007
4 Asmaa Saeed 1/1/1987 1/1/2007
5 Morhaf Mosa 1/1/1983 1/1/2009

 

The T-SQL procedure:

CREATE PROCEDURE [dbo].[DELETEDUPLICATE]
AS
BEGIN

DECLARE @FIRSTNAME VARCHAR(50) ,
              @LASTNAME VARCHAR(50), 
              @COUNT INT,
              @BIRTHDATE DATETIME,
              @HIREDDATE DATETIME,
              @DELETEDROWS INT


SET        @DELETEDROWS = 0

DECLARE C_EMPLOYEE CURSOR FOR
    SELECT
FIRSTNAME,LASTNAME, BIRTHDATE,HIREDDATE,COUNT(*)
    FROM EMPLOYEE
    GROUP BY FIRSTNAME,LASTNAME, BIRTHDATE,HIREDDATE
    HAVING COUNT(*)>1 –This select statement gets only duplicate rows

OPEN C_EMPLOYEE

FETCH NEXT FROM C_EMPLOYEE INTO
    @FIRSTNAME ,@LASTNAME ,@BIRTHDATE ,@HIREDDATE ,@COUNT

WHILE @@FETCH_STATUS =0
BEGIN

DELETE TOP (@COUNT-1) – if there is 5 copies of a record then delete 4 and leave one
    FROM         EMPLOYEE
    WHERE        FIRSTNAME    = @FIRSTNAME
            AND LASTNAME    = @LASTNAME
            AND BIRTHDATE    = @BIRTHDATE
            AND HIREDDATE    = @HIREDDATE

SET @DELETEDROWS = @DELETEDROWS + @@ROWCOUNT
– gets the number of rows deleted


FETCH NEXT FROM C_EMPLOYEE INTO
    @FIRSTNAME ,@LASTNAME ,@BIRTHDATE ,@HIREDDATE ,@COUNT

END


CLOSE
C_EMPLOYEE
DEALLOCATE C_EMPLOYEE

RETURN @DELETEDROWS
END

The output data:

ID FirstName LastName BirthDate HiredDate
1 Ahmad Ali 1/1/1986 1/1/2005
2 Mohammad Ameen 2/2/1980 1/1/2004
3 Bashar Kokash 1/1/1986 3/5/2008
4 Asmaa Saeed 1/1/1987 1/1/2007
5 Morhaf Mosa 1/1/1983 1/1/2009

 

The Result:

Now as a result we have a new unique copy of the same duplicate data with a little effort, that may save time and cost if duplicates were not processed

Comments

No Comments