Been Thinking About: Reference Data
Reference data is a funny thing. In reality, reference data really serves no major purpose other than for display purposes, whether that be a report or in the UI. Yet we often have reference data in our backend databases, with relationships established for drill-down purposes. In reports, a report proc swaps out a reference key for a name value that's more user friendly (after all, users don't care about keys, especially if they are artificial). In the UI, reference data may be bound to a drop down list, and the selected key gets passed to the SelectedValue of the list, therefore ensuring that the list displays and restricts the reference data input.
The question then becomes, is this the most efficient way to handle this? Are there better ways of storing/representing data, and changing data over time? Even reference data changes at some point, as the standards for the business process changes.
An additional point of view also comes from the ORM of choice; each ORM treats relationships in a different way. I've seen displaying reference data handled different ways. In strongly-typed datasets, the reference data may be loaded in the UI by querying it directly. It may have a DataRelation setup or may not. Sometimes, custom datasets are created for select-only purposes that includes the pertinent reference data, which can be a hassle to maintain.
For LINQ to SQL, reference data is accessible through a property. A foreign key reference is represented by a collection using the EntitySet<> generic class, while a primary key relationship is represented as an EntityRef<> class, and as such, these properties are dynamically loaded. This means that LINQ serves up the primary/foreign data when it's accessed; this default behavior can be changed by specifying the LoadOptions property of the DataContext (One note, create and call all of the LoadWith<> or AssociateWith<> methods for that load options object before creating the DataModelDataContext, as the DataLoadOptions object that specifies the load options is tightly coupled to the data context instance). LINQ to Entities works similarly, but a little differently, in that it gives you more control over entity references.
And so reference data may be easily accessible, but it may still involve a performance hit for every record accessed (although it can be loaded at the same time as the parent record is loaded). And so a reference table structure needs plotted accordingly for performance reasons, as sometimes networking issues bog down the database (due to latency or other issues). To circumvent that, I've seen anything from using enumerations to reference the data (each key is assigned to an enumerated value) to using XML as an alternative reference data store, as disk IO can be cheaper than DB access. Of course, each solution needs weighed accordingly, and I'm not making one preference over another.
An XML-based solution can't be the entire solution, meaning the reference data still has to take up space on the server. This is because XML won't have the same benefit for reports in SSRS or Crystal, as it would for ASP.NET. Some process would need to ensure the data matches up, and that there is proper failover if the data doesn't succeed. An enumeration approach would require an application change for every data change, which could be a nightmare (proper versioning would help this).
Looking at reference data from another challenge can be due to varying data because of changing standards. For changing standards, a reference table may include the following data:
1 Accepted 1/1/1900 NULL
2 Rejected 1/1/1900 NULL
The field field is the key, the second the name of a status, the remaining are effective/end dates (null means open ended) for that data. In this scenario, at one point in time we have two values present, but in the future, we may have three different values, and now our table looks like:
1 Accepted 1/1/1900 1/1/2007
2 Rejected 1/1/1900 1/1/2007
3 Approved 1/1/2007 NULL
4 Denied 1/1/2007 NULL
5 Pending 1/1/2007 NULL
And now, we have two rewrites of existing values, and one new value Pending. For displaying purposes, using a date and a range would easily pull back the correct values; however, what if we want to check if the item is approved, or accepted (for past)? Do we hard code checks for 1 and 3 when checking for approved status? Do these values get stored in a separate DB table? An alternative idea to solve this problem may be due to a reference code. Imagine the following update:
1 Accepted AC 1/1/1900 1/1/2007
2 Rejected RJ 1/1/1900 1/1/2007
3 Approved AC 1/1/2007 NULL
4 Denied RJ 1/1/2007 NULL
5 Pending PN 1/1/2007 NULL
This third new value is a coded value that represents the (AC)cepted status, and even though it was rewritten, the code still points to that. We can now simply check the code to see if it's AC, and this value can be hard-coded into the application (as it's an arbitrary value to mean Accepted/Approved). This way, I can ensure that there will be an accepted reference value no matter how many times the name changes throughout the years (change is the biggest constant in application development).
I'm risking that this may make no inherant sense to anyone, but I hope to convey some ideas on how to handle changes to reference data, and concerns about reference data and latency/performance.