Introduction
In this article we will take a look at how SQL Server performs with one million records in a table. We will create a quick data pumper program to fill up a table with a million dynamically created rows of data. From there we will take a look at querying the data without any special optimizations. Then we will create a Full-text index and see how that helps our searching capability. And then we will take a look at some of the added searching features that having a full-text index provides us with.
Following this article we will take a look at Lucene.NET to see how it compares to SQL Server full text search. Then we will take a look at how we can use Lucene.NET from within the context of SQL Server.
Creating the test data
In order to start comparing the performance and toolsets of both SQL Server Full-text and Lucene.NET we will first need to create a set of data to work with. This data should start off the same for both environments to make the comparison fair. In order to create this set of data we will first start by creating a console application that will use LINQ to SQL to pump a million records into a posts table in SQL Server.
The data pumper will use another table called paragraphs that will hold a few hundred or so paragraphs from which we can piece together full bodied posts to create our fake data. We will then create another table called tags to create some tag associations. We will shoot for 8 to 10 paragraphs for each post and will associate 0 to 5 tags per each post. We will only associate tags that actually exist in the body of the post that is created.
Creating the tables
Before we get started creating our data pumper we will need to first create a series of tables and some base data from which to generate our test data. The first table we will create is the posts table. This table will be the final destination for our test data. The posts table will hold the title of the post, the body of the post, and the associated tags for a given post.
Listing 1: Posts
The next table we will create is the paragraphs table. This table will hold a bunch of random paragraphs either from the "lorem ipsum" generator (www.lipsum.com) or selected from various .net oriented articles here on dotnetslackers.com. This table will be used to generate posts later on with the data pumper that we will create.
Listing 2: Paragraphs
The last table we will create will be the tags table. This table will hold a collection of keywords used for creating a tag cloud. These tags will be randomly associated with the posts that we create as we would when building actual posts for a blogging or forum system.
Listing 3: Tags
Creating the initial data
Now that we have the tables that we need to hold our generated posts we need to quickly build up a library of base data. We will start by going out to www.lipsum.com to generate 50 paragraphs. These paragraphs will go into the paragraphs table for randomly creating the posts later. Then we will need to go out to scavenge some paragraphs from articles on www.dotnetslackers.com. I am going to gather a few hundred of those.
Then we need to gather a selection of tags to populate our tags table. We can come up with a few of these on our own. If you need to come up with a few more you can go to any technology site to get a good list created.
Setting up the data pumper project
Next we will create our data pumper project. The project type will be a console application. This project will be used to create one million generated posts in our posts table. It will use a selection of the paragraphs and tags that we put into our initial data tables. We will loop one million times. With each iteration we will generate a post of 8 to 10 paragraphs long. Next we will run a test to see if any of the tags that we have in our initial tag list exist in the post. If we find a tag association then we will put the found tag into the post table along with the post.
Setting up LINQ to SQL
In the console application for our data pumper we will add a Linq to SQL data context (I named mine Sql which will result in a SqlDataContext class being generated for us). Then we will add a new connection to our data connections. And then we will drag our three tables from our new connection onto the LINQ to SQL design surface.
Writing the code to generate the data
Although our goal is to be able to generate one million records we will initially write a program that creates just ten complete posts (to test that our theory/code is sound).
The idea behind the code is to have a loop that iterates X amount of times. With each iteration we will generate a post with 10 paragraphs. Then we will analyze the generated post to determine which tags should be applied to that post. Next we will add the new post to a list of posts that we will maintain until we have generated enough posts to save to the database.
You will have to play with the appropriate number or posts to generate before saving them to the database. A small list will be easy to create but may take up to many resources when opening a connection to the database over and over again. Creating too large of a list may result in running out of memory prior to saving it to the database. Creating a medium list that is still fairly large may result in clogging the pipe to the database which may take too long to actually get the saving of the data completed. I ended up creating a list of 10,000 posts prior to saving them to the database (which seemed to work well on my system).
Data pumper code
Since this is a simple program we will work directly inside the Main method that is generated for us in the console program. The first thing we will do is to get the full list of paragraphs and tags in our initial lists of data. We will use those when generating our posts.
Next we need to create two functions. The first function CreatePostString is to randomly generate a post from the list of paragraphs. The other function is to scan the post to find out which tags are appropriate for that post.
Listing 4: CreatePostString
Listing 5: CreateTagString
With these two functions out of the way we can then look at the main body of code in the Main method. We are basically iterating for the number of posts that we want to create. We get the generated post from the function call. Then we get the generated tag set from the second function call. With that data in hand we can then create an actual Post object.
Next, rather than saving every post to the database we check the counter and save a list of Posts with ever 10,000 records created.
With all the code created all you have left to do is to run it. Don't expect it to run quickly though if you are generating 1,000,000 records. Also be aware that you may have errors along the way such as running out of memory for a large set of data or running out of file space in your sql server files.

Once I had generated 520,000 records I found that I had filled my quota with regards to my local SQL express installation - 4gb max database size! As I am more interested in hitting my one million row count than I am the number of paragraphs to search in each post I am going to shrink my post size down to 5 paragraphs instead of 10. And after a fair amount of time, creating 100,000 posts prior to each save to the database I finally processed all 1,000,000 records! This resulted in a database that is just shy of the 4gb limit.

With all of these records in place I then did a simple Count(*) style query to see how many records actually made it into the database. This is a simple example of what dealing with a million records is like. As you can see in the image it took nearly 10 minutes for the database to determine how many records were created.

Searching the data without an index
Next we will run a more realistic query where we look in the posts column for any records with "SqlMetal" in the text and where the tags column has an "asp.net" entry. The query will look like this:
The above query took 3 minutes and 5 seconds to run and returned 26,999 records.
Creating a Full-text index
Now that we have seen how slow searching a million records can be we will enable Full-text searching and create the required indexes to perform the same search with quicker results. We will also take a look at what it takes to get this set up.
First you need to make sure that you have installed full-text search capabilities with your sql server engine. Then we will need to enable full-text search with this snippet of sql.
Next, we need to create an actual index for the table that we are going to search. Be aware that in order to create the index on a table we must have a single column in the table that has a unique non-nullable index which would be the primary key that we have in our posts table. Also be aware that the size of your primary key column will be stored with your index. This is important to know as if your primary key is built around a GUID your index will swell up quite a bit. In this case you might consider creating an int based primary key which will be considerably smaller than storing a GUID in the index.
In the sql below we are creating a fulltext index on the posts table around the body column. We are using the PK_Posts index which is the primary key field PostID on the catalog that we created in the previous sql. Here is that code.
Once you have run the above sql you can navigate to the properties window for your newly created index. Do this by right clicking on the table, selecting modify or design, then right click on the arrow next to your column and choose full-text index. This will show a properties window for the index we just created. In this window you can check the status of your new index. The thing you are interested in is that Columns property is set to the Body column as specified in our sql. We also need to make sure that the Active property is set to Yes. And then you need to look at the Crawl Completed section. If you are looking at it right after creating the index then the odds are that you will see that the crawl is not completed. It takes a while to create these indexes!

Searching the data with a Full-text index
Now that we have the full text index created we can take a look at some query metrics. First we will run a simple query to see how long it takes. Then we will run the modified query to see what sort of improvements we have.
This query took 3 minutes and 22 seconds the first time it was run and returned 89,684 records. Each time it was run after that it took 3 minutes and 7 seconds. Not a huge improvement. Now we can modify this query so that it takes advantage of our new full text index. The query looks like this.
Running this query initially took 32 seconds which in and of itself is a huge improvement. Running it each time thereafter it took 3 seconds each time which is an even larger improvement! Now let's add the additional query that we started with where we look for all posts with an asp.net tag attached to it.
Running this query with the added where clause ended up taking just 29 seconds! The next time the query is run it takes just 1 second. Query caching at its best.
Now for the power of the full text search. Using contains searches the text for the specific match of the search. We can also use a search term called FreeText that will look for all sorts of inflections of the given search query. To modify our search to look for all instances of "use" such as used, user, use, etc. we can run a query like this.
This query returned "used the", "uses extensions", and "use that" for example!
Summary
In this article we created a data pumper to dynamically generate one million posts with appropriate tag sets. We then looked at what querying one million records performed like. Next we looked at what it took to create a full text index. Then we queried the same data with a similar query that utilized the full text index to see what the performance differences were.
In the next article we will take look at what it takes to create an index for Lucene.NET. Then we will execute some queries against the Lucene.NET index and see what sort of difference we get in the results between the a Lucene search and a full text search.
About Andrew Siemer
 |
I am a 33 year old, ex-Army Ranger, father of 6, geeky software engineer that loves to code, teach, and write. In my spare time (ha!) I like playing with my 6 kids, horses, and various other animals.
This author has published 29 articles on DotNetSlackers. View other articles or the complete profile here.
|
You might also be interested in the following related blog posts
Lucene.net: the main concepts
read more
Script to Inventory Print Servers
read more
MOSS 2007: Index and Web-Front-End Server Topology
read more
Upload Images To SQL Server By Way Of An ASP.NET Web Form
read more
CS Dev Guide: TextParts
read more
Handling "GO" Separators in SQL Scripts - the easy way
read more
Determine whether a SQL Server exists ..
read more
Bak2Basics: Know thy indexes
read more
SQL/e - Microsoft to release an Embedded Database version of SQL Server
read more
SQL Server FAQ : How to join tables from different servers
read more
|
|
Please login to rate or to leave a comment.