SQL Bulk Copy - how I love thee?

June 24, 2022
sql sqlbulkcopy toolshed errors

A classic but sometimes PITA ETL class that is essential. A couple errors that are annoying but easy to identify and fix.

I have been using SqlBulkCopy (SBC) for way longer than I wish to admit. It's amazing that it is still the de facto way to import huge amounts of data into a SQL Server. It just is.

Way back when I was using it, and again recently, I used it to import flat files. At the time I was working inside of console apps that ran as services on app servers. We also did some importing directly into SQL Server and running it inside of a proc. It was trickier, though maybe faster, because we had to have the files on the SQL server or via a mapped drive. Neither were great for me. I like to keep the data away from everything else. Let SQL Server do what SQL Server do best.

I don't want to wax on about my history with SBC. The truth is, it's great, but it's a fastidious little bitch. Actually, it's picky, exacting, unforgiving and all of those can be prefixed with unreasonably.

There is no reason why the case of a column should matter to SBC when SQL Server (under my collation) does not case. Why in some cases, the order of columns seems to break things and why the errors returned are as cryptic as all the riddles and side-plots of a Dan Brown novel.

But we have to use it. Because the alternative, inserting mucho rows using INSERT statements, is not feasible. You can try, but 500K inserts is going to make any SQL Server unhappy.

Here is some anecdotal experience and advise and utilities when importing large datasets and specifically when using SBC.

Error #1 - truncate string colid x or Error 4815 (Azure)

This is probably the most typical error encountered. It happens when your source data strings is longer that your destination column. This is especially common with new imports because you're unfamiliar with the data and some errant column happens to have an error value of “YES” instead of "Y." It's notoriously hard to figure out which is the column because the error message colid is typically wrong - or no one knows where it starts counting. 4815 is the same error from Azure SQL. While you should know the lengths of your data columns, let's be honest, we don't. And a lot of source systems have shit like VARCHAR(MAX) for columns that are 10 characters long. (side note, I have seen some errors lately where it tells you which column is problematic. But only sometimes. I am not sure when it's happening, but it's nice).

What to do

You can always set every column to varchar(4000) or nvarchar(4000) or (max) if you be crazy. But if you plan on using this in an index that's not doable. Also, I am sure there's some really bad SQL engine stuff going on if you decided to make every non-indexed column 500 characters. So, let's, in the face of ignorance, assume you cannot and do not want to this and would like a column that truly represents your data. Here are some steps that can help. In no particular order.

Find the max length

There are two ways to do this depending on your environment. 

First way: Make the table varchar(4000) or max. Then insert all your data. Then do SELECT MAX(LEN([column name])) on the column you care about. Oh, every column? Use this gist to get the max in each column. That query will report back the max size based on the data in your table. It's pretty sweet.

Second way: I have since decided to not load up millions of rows to get the max size for various reasons. Now, I process the data and just log it (to screen or wherever) the max columns. For example, in [toolshed] I have this precious little thing: 

public static void LogToConsoleStringLengths<T>(List<T> data, bool stopWhenDone = true, Type[] ignoreColumnAttributes = null)
        {
            foreach (var item in typeof(T).GetProperties())
            {
                if (ignoreColumnAttributes != null)
                {
                    var d1 = item.CustomAttributes;
                    if (d1.Any(x => ignoreColumnAttributes.Contains(x.AttributeType)))
                    {
                        continue;
                    }
                }

                if (item.PropertyType == typeof(string))
                {
                    var longest = 0;
                    for (int i = 0; i < data.Count; i++)
                    {
                        var l = (string)item.GetValue(data[i]);
                        if (l != null)
                        {
                            longest = longest > l.Length ? longest : l.Length;
                        }
                    }

                    Console.WriteLine($"{item.Name}: {longest}");
                }
            }

            if (stopWhenDone)
            {
                Console.WriteLine("Done - click a key to continue");
                Console.ReadLine();
            }
        }

It will go over the provided list log the longest string for string properties. This is a super easy way to track what could be longer that what your destination allows. In some future iteration, I'd like to query the DB and get back what's the max and then just highlight to culprit. But, another day.

Error #2 - Column not in source or destination

This is a combo., Since you're not really sure where the bug slipped in. Was it the source or destination - why won't you tell me!  But ultimately, it's bad somewhere.

Let's go for the easy one first - destination. Whether you can do this will depend on your code, of course. My imports are typically represented by a class. I use NPoco as my micro ORM. To ensure that my destination table has the columns I expectt it to,, I just select from it. And since Npoco does the hard work, this is easy.

_ = await NpocoDbVariable.FetchAsync<MyDestinationClass>("WHERE 1 = 0");

If I have a column wrong or not in the database, this will throw an exception. This is one of those errors that usually sneaks up on me while I am designing a new table. Especially a summary table where I am deciding what to add and what to name things.

Now, you might get mad at me. Since I add my explicitly add columns to SBC, I have yet to have a source issue. That would show rear its head earlier in the process. So I don't have a solution for that :) except - make sure your columns match your destination class. If you have different names for both and then you tie them together while mapping them….well, you are brave. But still, this shouldn't be an issue.

Error #3 - Types

This one happens less often but can show up. I use a version of the about to fix it. It's when you try to shove a date in an int column. Inserting a new record using NPoco usualy solves identifies it. Or just a stare and compare. Not exciting, I know.

And that's it.

P.S.: Toolshed

While we're on this, let me share a little helper I've got that I have been working on for years. Seriously. I used to copy & paste the DLL into projects before NuGet. At one time I also referenced it by all the projects. Debugging was nice back then, though. It's now on GitHub with a nice easy to use NuGet. I add it to every project. It has a metric ton worth of helpers. Specifically for my life, but you might find it useful. Feel free to use it or copypasta the code.

I have quite a few helpers are SqlBulkCopy (and other stuff). Check it out. toolshed/SqlHelper.cs - toolshed (github.com)