ETL - That's Wrong, Don't Delete

June 12, 2020
sql etc dev

You have an ETL process (importing data) that deletes and inserts data. This is what I do to make sure it's not a mistake to keep going.

The Obigatory Preface

There's more than one way to skin a cat. Why would nayone skin a cat is beyond me. It's in poor taste (no pun intended). It's inhumane and it's just odd. Did someone wear a cat on their head, like Davie Crocket? Skin a beaver makes more sense. Anyway...

This could be done a few different ways and I do this, in my all code bases, a few different way. This is just one that, in this instance, performs the best. By the best I mean - in and out.

OK - meta & potatoes.

You bring in some data and it's 156K rows. You cleaned it up, you parsed it, yo massaged it, you adjusted it to satisfy the bribe you took to make it look different (that's a joke; kinda, does nayone believe this never happends? Really, it's just some guy, like me, that controls your data...). You're good to go. Let's replace what we have.

In this example, it typically goes like this:

DELETE [Table] WHERE IsTrue = 1

INSERT INTO [Table] SELECT * FROM [OtherTable]

Now, weather you have 10 more criterions or ensure that you only delete things that make sense, what you dodn;t check for is VOLUME.

Let's say you make sure that you only delete data for the same date:

DELETE [Table] WHERE IsTrue = 1 AND ThisDate IN (SELECT ThatDate FROM [OtherTable])

INSERT INTO [Table] SELECT * FROM [OtherTable]

That's kewl. Now you only delete data for the same date you're importing. This can be for any number of fields and different scenarious. We can avoid the yeah buts..this is a specific scenario I am dealing with. Keep reading, Internet Hitchhiker

What is you're import only hasone (1) row for that date because the ETL process was flawed, or the source was missing data? The DELETE still shows true, but you deleted 110K rows and replaced it with 1 - not good. no bueno. Malo. Muy malo. Unless it's supposed to be that way. Then you can move on.

This is how I handle this scenario when something like that SHOULD NOT happen - by design.

DECLARE @newCount INT;
DECLARE @existingCount INT; --if you deal with much data use BIGINT. cause 2B is peanuts.

SELECT @newCount = COUNT(*) FROM [OtherTable];
SELECT @existingCount = COUNT(*) FROM [Table];
DECLARE @percentNewOfOld float = CAST(newCount  AS float)/CAST(existingCount  AS float);

Now, there's a few things you can do here. You can check counts, percents, etc. This is one example.

IF(@percentNewOfOld < .75) -- I'll take a 25% reduction in volume - maybe you need more than before?
BEGIN
  --we have a problem - now you can do what you do when you have a problem
--LOG IT..
--RAISEERROR(...
--GOTO: Fucked...
END

And that's it. 

REVERSE-TL;DR

You wanted to replace some data, but before you, you want to make sure that you're replacing that within an acceptable range and if the range is not met, abort. Maybe you expect for volumes to change by x percent (number of appointments actually worked on vs. forecasted) or maybe you're number can only go up (number of people that call in on a given day). Either way, but cognizant of this and Code Defensively!