Search This Blog

August 23, 2010

Data Infrastructure Architecture [2/2]

Yesterday I wrote a fairly long blog post which started to delve into some of the questions addressed by data infrastructure architecture.  After writing more than 2000 words I decided it was time to give you all a break and for me to have some time away from the keyboard.

So here’s the second instalment in which I’ll outline the remaining concerns for the data infrastructure components of data architecture.

Data platform performance management

One of the key success factors for a stable DBMS is that it can give business application users the data they need in a timely manner.  Making sure that the data platform has the system bandwidth required to address the workload is then very important, and managing this requires the implementation of supporting infrastructure and process, which must provide ways of answering a number of questions.

Is the server sized to cope with the existing workload?  What components of the physical server environment are under stress?  What are the system bottlenecks that are causing performance stress?  What sort of query latency are users having to endure?  Is the server running into deadlocks or managing huge blocking trees?  Is server stress a cyclic or ongoing phenomenon?  Is the storage layout of databases appropriate for the workload?  What is the business impact of data platform stress?  Is data platform performance meeting service level agreements?  What (if any) are the penalties for service level breaches?  How do you balance the cost of service penalties against the cost of prevention?  If database queries have to be tuned, who is responsible for this?  Who manages the engagement of third party ISVs whose queries are running slowly?  Do they have SLAs to comply with or service penalties against them?  How is all this being measured?  Are these measurements sampled from real-time data-points or are you sampling averages?  Do you have a performance data warehouse in place?  Can you detect trends in performance over time?  Can trends be linked to any specific configuration changes or business changes?

Not bad for a single stream of consciousness… even my head’s spinning.  Suffice it to say that having some structure about how performance is measured and managed is a key component of selling data platform infrastructure architecture as a service to your clients.

Data platform security & access control

Data platform security design ultimately answers two questions.  Who gets to access your data?  And under what circumstances do they get to enjoy that privilege?  Of course, it’s not that simple.  There are a number of far more detailed questions that need to be answered, but who/what and where/when/how are the items that business people understand.  That is: Who is able to query the database?  What data can each person access and what can they do to it?  Are there restrictions on where the user can access their data (e.g. home/work/mobile access scenarios), when they can access it (e.g. limitations on end-user access during batch processing or database maintenance windows) or how they access it (e.g. from named applications vs. end-user tools such as MS Access or Excel)?

At a more technical level we can talk about attack surfaces, defensive coding mechanisms, authentication, identity maintenance and management, exploit management, application-level query validation, roles, schemas, encryption, directory integration, firewalls, DMZs, code/query execution policies, identity impersonation, SQL injection, exception logging and auditing, patch management and more.  Ultimately these technical terms all address the fundamentals of who/what and where/when/how.

Data platform backup & recovery

Once you have a bunch of people happily transacting away in your database, they’re going to have some expectations about that data still being there when the come in to work tomorrow.  I know… just plain unreasonable, isn’t it!  Sadly, IT professionals just have to get over the fact that business people actually want to use the applications we IT people build for them, which means that data still being around the next data matters.  To this end, data platforms need to be backed up on a regular basis as data loss can and does occur.  Backup planning requires that you have the capacity to store at least some of your backups for a long period of time, that the data in those backups is not compromised as a result of page-level corruptions, and that you understand the full path that your data travels between your database and its final resting place – typically in a third party tape storage centre.  You need to make sure that if your backup medium changes that old backups are migrated to that medium.  You also need to make sure that media storage of backups and its transport is secured through the entire journey.  You need to define your roles and responsibilities around backup management.  Are your DBAs responsible for changing tapes?  Or is this something that gets managed by your server team?  What about archive management tools?  Are you going to use SQL Server specific backup agents?  Ore will you back up to disk then backup the files to tape?  Compression and encryption are additional factors in your choice of backup solution design.

However, it’s not just enough to do backups on a regular basis.  I have learnt from bitter experience that you need to be able to verify that your backup solution can be restored as well.  What are the processes for getting your backup media back from offsite storage?  What does that process cost?  Who is responsible for mounting the tapes for restores?  What impact does restoring data have on ongoing operational processes?  Do you need to do your restores over the top of existing data, or even to a copy of the database on the same server?  Is there enough disk space to support the restore?  What is the objective of the restore?  Are you trying to get at data that went missing in one table?  Or is the whole of the original database compromised?  What is the maximum acceptable data loss in your service level agreement?  What is your SLA for maximum acceptable time to system restoration?  How is the decision to perform a restore made?  Who makes it?  What are acceptable grounds for database restores?

Backup and Recovery Services (BURS) is a complex area of technology where the broad sweeping designs aren’t what will cause you pain.  It’s all the niggly little details that hurt you, and it’s important to ensure that you’ve got the right heads focused on getting them right.

Data platform license management

This is kind of an odd one… it sort of belongs under data platform selection but it also has implications for support and for total solution costing.  The choice of licensing models is an important one, and has much broader reach than you might think.

How do we go about selecting a licensing model then?  Well… lets start with features.  Different editions of your data platform will have different features.  If there’s a killer feature that you MUST have, or pays for the price difference between editions then you should make the business case and request the more capable edition.  If the feature is not going to provide sufficient ROI, then you need to make sure the project understands the additional costs that will be incurred as a result for delivering the same required results.

Another important consideration is existing agreements.  For example, if your business has a Microsoft Enterprise Agreement that gives everyone SQL Server CALs as part of their bundle, then you can license SQL Server in “Per Server” mode rather than “Per CPU”.  This is usually much cheaper, but has some limitations – namely, that you cannot then use that server to store data accessible from the internet unless you also buy an internet data collector license. 

Do you need multiple editions of MS SQL Server on the same physical host?  If so, you may find that some editions require a separate license for each DBMS instance, while others might cover any number of instances up to some named maximum.  Carefully weigh up the costs per instance to make sure that you have the right level of coverage.

You also need to consider whether you buy the licenses up-front, or whether you buy your software through a vendor who can offer you MPLS monthly charges.  Either way, this choice will have corresponding cash-flow impacts on the business.  If you’re working for a cashed up enterprise then the monthly charge might be more attractive as they already have sufficient savings, debt-worthiness or cash-flow to be able to sustain monthly licensing.  However, if you are working for a start-up, they may elect to go for the up-front single payment and treat it as an investment in not having to deal with a recurrent cash-flow impact.

The final licensing factor which is often ignored or poorly understood is software assurance and support.  In the case of SQL Server, software assurance is reasonably expensive, but prevents a cost at a later date of having to pay to upgrade to a new version of the product.  TechNet and MSDN licenses should also be considered, as these give you access to development and support resources which might not otherwise be available – e.g. a certain number of Customer Support Services cases per year, or early access to pre-release editions of SQL Server for regression testing and review. 

Data platform application & integration interfaces

This final aspect of data platform infrastructure architecture is often neglected by DBAs, and is typically a good way to tell between a DBA and a data architect.  Every data platform has a set of supporting technologies that allow data consumers to access data from applications, or have data fed from one data source into another.  Putting corporate standards in place for how the supporting infrastructure for these activities hangs together is a critical success factor for any data platform design.  In the case of SQL Server there are a lot of different options on how to connect to databases, ranging from true legacy libraries such as Embedded-SQL and DB-Library, through to the latest and greatest SQL Server Native Client libraries, ADO.NET versions and the abstraction layers that can sit above them in the application tier – including ORMs such as Entity Spaces, LINQ2SQL, LLBGen and NHibernate, or more model-driven architectures such as Entity Framework.  Each of these solutions has a batch of client-side requirements that need to be installed on application servers (for thin client/N-Tier applications) and/or client desktops (for thick-client applications).  Data architects should be a driving force in selecting these technologies and provide guidance on how they should be used in applications.

Integration technologies are an even bigger can of worms.  You have SSIS and SQL Server native replication as immediately obvious options.  Native-ish “application” interfaces such as SQLCMD, PowerShell, Service Broker and Sync Services are also solid data integration paths for the SQL Server platform.  However, after you get past those you have literally hundreds of different ETL tools, data integration platforms, heterogeneous replication engines, client-side applications and messaging solutions that can all quite happily talk to SQL Server.  Working through the noise to standardize on a subset of harmonious and complementary technologies is not an easy task, but it should be considered a high priority for any data platform architect.

Of course the usual questions of roles and responsibility then enter the two problem spaces above, as does the ongoing maintenance of your technical policy choices.  Sometimes version m of tool X is better than version n of tool Y, but version m+1 of X is nowhere near as capable as version n+1 of Y.  Do you switch horses when this happens and hopes that X eventually catches back up in version m+2 or m+3?  If you don’t, then at what point do you switch to Y?  Straight away at n+1?  After ongoing market leadership has been established at n+2?  These are all important questions and they require not only an insight into the discrete capabilities of the technologies, but also the visions and directions of their respective manufacturers.

Well… there you have it… a quick (4100 word) overview of data platform infrastructure architecture.  By no means have I been exhaustive!  But I hope you’ve got something useful out of these entries.  In the next blog I’ll start talking about database infrastructure design in more detail.  Once I’ve finished talking about some of these architecture pieces at a high level I’ll come back and revisit the detail, wrapping some more structure around the questions that need to be answered and the design options available.  Until next time – code well, code often!

August 22, 2010

Data Infrastructure Architecture [1/2]

Yesterday I wrote a blog defining Data Architecture as a field of endeavour.  Today’s entries is the first of several that will drill down into the individual concerns that data architecture addresses.  I’m going to start with Database Infrastructure Architecture, as it probably has the fewest dependencies on other areas of data architecture.

Defining The Challenge

What is database infrastructure architecture?  The short answer is that it’s all about the platform, and not about the content.  Database infrastructure architecture addresses very concrete requirements around making sure that data can be accessed by the right people at the right time in a reliable manner within acceptable performance parameters.  To these ends, database infrastructure requires that the BATOG principles address the following functional requirements:

  • Data platform selection
  • Data platform host environment selection
  • Data platform capacity management
  • Data platform configuration management
  • Data platform support & problem management
  • Data platform performance management
  • Data platform security & access control
  • Data platform backup & recovery
  • Data platform license management
  • Data platform application & integration interfaces

As discussed, this blog is mainly going to show how we address these issues from a SQL Server perspective, so we’re going to discuss these in the context of Microsoft SQL Server 2008.

Obviously, each of the bullet points expands into a sizeable body of knowledge in its own right, so today I’m only going to set the context for later posts on each topic.

Data Platform Selection

Choosing a data platform is a foundational piece in any data-related solution design.  There are a number of different options for types of platforms, including “Document Databases”, marked up structured file-storage (e.g. XML), serialized binary object stores (AKA object databases). and relational database management systems – each with their own strengths and weaknesses.  However, over the last 15 years the IT industry has moved far more decisively towards relational databases.  This is because relational DBMS platforms tend to have a broader mix of strengths and fewer weaknesses than any of the alternatives, particularly with respect to data retrieval performance (thanks to generically functional indexing capabilities), data protection (backup and recovery) and system scalability.  In a society where the net amount of data being stored grows by 15-20% per annum, scalability has become a critical factor in platform choice.

As discussed, my focus within this blog is on Microsoft SQL Server, so for my purposes, SQL Server can be considered the “Default” data platform selection, but there are still decisions to make about the editions of SQL Server to install.  That choice will be driven to some degree by corporate standards, but there are other factors, such as the scalability required to meet the expected business workload, the number of SQL Server extended features required to manage and deliver the workload (e.g. Reporting Services, Integration Services, Analysis Services and other bolt-ons introduced in SQL Server 2008 R2).

Data Platform Host Environment Selection

Now that you’ve picked your database platform of choice, you need to start thinking about how it’s going to be hosted.  SQL Server only runs on Windows, but that still leaves you with a number of choices.  For example, are you going to build dedicated servers or run SQL Server within virtual machines to get higher utilization out of your hardware?  Are you going to opt for a console-only version of the Windows Server operating system or a fully GUI enabled one?  What about server operating system editions?  You have price and scalability factors that need to be considered, for example, the RAM and CPU scalability supported by each edition of Windows, and your choice of SQL Server edition can also influence your operating system choices.  There’s also questions about availability requirements – do you need to use failover clustering or some sort of equivalent virtualization technology to ensure high availability?  This is already a lot of questions, and we haven’t even started to talk about the difference between scale-up, scale-out and logical vs. physical virtualization!  [I’ll tackle those in future posts.]

Data Platform Capacity Management

Now that you’ve figured out the broad brush strokes on what you’re going to install, you’re going to have to figure out how much storage you’re going to need for your data.  Here’s some things to think about.  The most critical element of capacity for SQL Server is disk or SAN storage.  You need enough storage to hold all the data that is currently in the database, as well as for future requirements.  A starting point for this kind of estimation is to estimate the maximum expected number of rows in your tables and multiply that by the average width of data in those tables.  However, storage estimation is not as simple as it seems.

Each data page in SQL Server is 8KB in size.  If your rows are not exactly 8KB in size, this means there’s going to be quite a bit of dead space in your database.  This is not a bad thing – the consistent page size makes tuning I/O for SQL Server quite simple, but it means that you need to factor in the white-space in your data pages when sizing the number of pages required for a given quantity of rows in that table.  e.g. If you have data rows in TableX with an average of 6KB per row, and you need to store 1,000,000 rows – you’re not going to need 6KB x 1,000,000 = GB of data.  You’ll actually need 8KB x 1,000,000 = 8GB of data.  You also need to factor in index storage, LOB storage, full text index storage and XML storage in your space estimates.  You need to size not just for current data loads, but for future volumes – and if you’re going to say “We’ll move to SAN storage only when we really need to,” you must make sure that you have the required capacity in your SAN ahead of this need.  Also be aware that business needs change, and the capacity growth forecast by the business unit using your data server may well grow or shrink depending on business conditions.  This has an impact on your database capacity, so you need to keep an eye on changes on the rate of change of data-server capacity over time – you can’t just trust your initial estimates and hope that they’ll stick.

Another key consideration is storage for data that sits outside the boundaries of the DBMS itself.  This includes log files, historical configuration snapshots and performance monitoring/trace data, backups (an important one – don’t just plan for your full backups – also plan for your incremental and transaction log backups as well), and mirrored/log-shipped instances on “warm standby” hardware.  It is not difficult to find yourself needing up to 5-6 times the amount of storage for a high availability system than is consumed by the user databases by the time you get into all of the different storage requirements mentioned above.  Keep in mind that some data is more important than others, so you can run multiple storage tiers accordingly.  E.g. SQLDIAG output could easily be written to a non-redundant SATA drive instead of high performance SCSI/SAS disks.  On the flip-side, database backups should be written to tier 1 storage at some point to prevent data loss.  This might seem like an expensive choice, but if your backups aren’t available when you need them then you’re carrying a large risk.

Of course, capacity management also needs to take into account factors such as I/O bus bandwidth, CPU load, RAM required to maintain high cache hit ratios, network bandwidth and limitations arising from low level algorithms for managing CPU instruction queues.  These capacity requirements are rarely static.  They will almost always have peaks and troughs that offer constraints and opportunities for savvy system designers.

Data Platform Configuration Management

One thing I’ve learnt over the years is that installing a database server is not difficult – any mug can do it, and to a larger extent than you’d realize, any mug does!  But we’re talking about professional IT here, so there are expectations on database professionals that go well beyond just getting the data-server installed.  The data architect needs to mandate (or at the very least, review and approve) policies for how data platforms are to be configured within the enterprise.  This includes things like standardizing the collation to be installed on each server, transaction log and database backup policies, surface area configuration policies, disk layout standards, monitoring/pulse check configuration, standardized patch levels, consistent SQL Server install/database/log/backup directories, and all manner of other SQL Server configuration best practices.  The single most important design factor here is that the configuration of every server must be recoverable in case of emergency situations. 

Not only do these policies and procedures need to be created, but they also need to maintained, versioned, and applied as necessary to non-conformant systems.  SQL Server provides some great tools to assist with this – especially the Policy Based Management feature shipped with SQL Server 2008 – but there are other options out there for doing the same thing.  You can use System Centre to apply configuration policies to a named set of servers, and to monitor the SQL Server environment.  The SQLH2 tool can be used to perform standardized monitoring of SQL Server instances.  Additionally, you also have 3rd party options such as CA’s Unicenter and HP’s SA+OO (system automation + operations orchestration) toolsets to coordinate these kinds of tasks.  Standards are good, but tools that apply standards for you are even better!

Data Platform Support & Problem Management

If you’ve been through all the steps above, you probably have a database or two installed by now, and you now need to start thinking about who’s going to look after it.  This is where we start asking whether we need a dedicated Database Administrator (DBA) with skills in the platform we’re using (SQL Server in my case) or whether a DBA from a different platform (e.g. Oracle, DB2) can afford the time to get up to speed with a new DBMS – or possibly whether a Windows administrator can deliver the necessary level of support.  In small office environments you quite often have a single IT person supporting all of the infrastructure technologies in use, but in large office environments a more tightly specialized workforce is often hired. 

Having figured out who’s going to support your data platform, you now need to make sure they know what they’re doing.  Have they passed any SQL Server certifications?  Are they aware of the Sarbanes Oxley (SOX), ISO 900X and ITIL standards?  Do they have all the tools required to support the environment effectively?  Do you trust them to get the system back up and running when things go pear-shaped?  Or is your system sufficiently non-critical that you can afford to be off the air for a couple of days while a consultant is engaged to help with recovery?  Do you need to get training provided for the person you’ve hired?  Do they need some non-delivery time to focus on ongoing learning?  These are all important factors to consider when defining the human side of data platform support.

When things go wrong, do you have a process for how to address common issues?  Does your DBA know how to get a database back online in emergency mode to allow data recovery to occur?  Is a disaster recovery (DR) environment available to get users running while you fix the production environment?  Do you have procedures in place on how to failover to DR and back to production?  Have they been tested?  What are the organizational processes that come into play when the DBA identifies a problem with the system?  Are you following ITIL?  Do you have a problem management board in place to communicating problems to the business and giving the techos enough space to stay focused on the problem?

As you can see… this area of data infrastructure architecture requires careful thought and consideration, as it needs ensure that the costs of preventing or minimizing the impact of system outages do not outweigh the business impact of having the system offline.  The architect needs to have enough awareness of the cashflow impacts of outages to make the best choices on their clients’ behalf.

To Be Continued…

I’ve covered a lot here so far, but I need a break from typing and by now I’m sure any readers of this entry will also need a break from reading!  I’ll continue with another instalment in the next 24 hours or so.

Data Architecture Defined

Over the last few weeks I’ve been working with the PASS organization to start up a SQL Server Data Architecture Virtual Chapter (or “DArcVC” as it is starting to be known).  Data Architecture has basically been my job for the last 12 years.  I may have been paid as a DBA, or SQL developer, or BI consultant, or Microsoft Application Solutions consultant, but at the end of the day, what I’ve DONE in all of those roles has really revolved around data architecture.

For those of you not familiar with what data architecture entails, it’s probably worth starting with a definition of IT architecture.  IT architecture is the process of envisioning and designing information technology solutions that align to clients’ expected outcomes.  The architect who designs your house, hotel or office tower does pretty much the same thing.  The only difference is that they work with a different medium.  In a business context, the role of an IT architect is to envision and design systems that deliver tangible business outcomes.  In an entertainment context (e.g. computer games), the role of the architect is to envision and design systems and content that deliver entertainment outcomes. 

You get the picture… Architecture defines the end-state goals for the system builders to follow.  With that said, architects also have a hand in defining the processes, governance structures and organizations that deliver the system at hand.  The architect also provides a financial context for the work – from both cost and benefit perspectives.  Much like a bricks and mortar architect recommends certain building materials and construction methods to deliver measurable outcomes (e.g. energy efficiency, resistance to certain types of environmental conditions, etc) an IT architect makes recommendations on how a system must be structured to deliver value to the people that will use it.

The core concerns of IT architecture can be summarized using the acronym “BATOG”, defined as follows:

  • Business – Who is paying for the solution?  What is the total budget?  What is the expected ROI period?  What business efficiencies is the solution expected to deliver?
  • Applications – What software packages is required to deliver the required outcomes?  Do they need to be customized?  Or are they being built from scratch?  How do you make them work together?
  • Technology – What do you need to build and run your solution? 
  • Organization – Who will build and support the solution?  Who will make sure it gets delivered on time?  Who will manage the budget?
  • Governance – What are the processes that need to be enforced to ensure that the system meets the clients’ requirements?  How are these processes enforced?

Armed with that information, we can then apply those concerns to the Data component of IT architecture.  With that filter applied, data architecture can then be described as the process of envisioning and designing Data systems that deliver required client outcomes.  In a more task focused sense, this means:

  • Selection of data persistence systems (files, DBMS tools, etc)
  • Designing data models and creating a roadmap for their realization in end-state systems
  • Designing data acquisition, integration and presentation systems
  • Designing data lifecycle management systems (e.g. archiving/purging tools)
  • Ensuring that data is secure, but also accessible to the right people
  • Ensuring that data persistence systems are reliable – or, at the very least, have a recovery path from system-level failures.
  • Ensuring that data designs are closely aligned to business processes
  • Defining the organizational requirements that support the outcomes listed above (e.g. skills, training, roles)
  • Guiding the expectations and estimation processes of program/project managers who oversee data-oriented projects
  • Defining data platform configuration and sizing policies

There’s plenty more, but that gives you an idea of some of the high level tasks that data architects get involved with.  Data architects are typically not only concerned with data, but also in how raw data is turned into information and knowledge.  In this layered view:

  • Data is the “What”.  For example:
    • What is the customer’s name? 
    • What did they buy?
    • What did we charge them for it?
  • Information is quite literally the deliverable of informing a human about the what – especially when it leads to decisions and actions that deliver outcomes that provide value of some sort.  For example:
    • Why did that customer want to buy that product at that price?
    • Could we have charged the customer more for that product?
    • What are other stores charging for the same product?
  • Knowledge is the understanding that underlies the decision making enabled by information and data.  Knowledge is the secret sauce to how we capture, design and consume information.  For example:
    • The customer was in the 18-25 age bracket.  We know that people in that age bracket have a preference for the kind of product that customer purchased.
    • The decision to place that product in the window of stores in areas frequented by people in the 18-25 age bracket will lead to lots of sales.
    • If we make those people walk to the back of the store to actually pick that item off a shelf, then they have to walk past all sorts of other tempting items.  This may lead to impulse-driven cross-sell/up-sell.

So a data architect isn’t just concerned with the collection, storage and processing of data, but also with designing data to provide information outcomes, and using information to either extract new knowledge or to use knowledge to identify new opportunities for how we can derive value from information.

In the business intelligence community a commonly used description for why we build business intelligence systems is “Actionable Insight”.  Actionable insight is the goal of all data architectures, and is the key value proposition for capturing data in the first place.  Of course, if we do not design our data systems properly, we run the risk of being unable to derive the required business insights, which is no help to anyone.  If you can’t get actionable insights out of an information system, then there’s not much point spending money on it in the first place.

With that, I’m done.  I’ve given you a quick summary of IT architecture, and how data architecture fits into the broader context of its parent discipline.  I’ve given you an idea of the kinds of concerns that a data architect addresses on a day to day basis, and I’ve provided some perspectives on why data architecture is important.

In future blog entries, I’ll start drilling down into more detail on some of the specific practices and principles involved in data architecture.  I’ll use SQL Server as a means to explain and demonstrate those practices & principles, and to provide specific guidance on how SQL Server helps to address a variety of data architecture concerns.

August 12, 2010

Musings on Good Data Architecture

I was at a user group meeting tonight where a couple of developers were presenting on Entity Spaces.  It was interesting to see the tool in action, but I can’t help but feel a little cautious about the kind of strong typing that’s going on in the myriad of class files that are getting built by the code generation framework.  To me, it seems risky from a complexity perspective. 

In my opinion, a good ORM should be able to abstract the schema from the object-level implementation and provide a software-factory driven solution that can create the required Data Access Layer objects at run-time, and wrap the stored procedures, views and functions in the database with appropriate invocation interfaces.  Late-binding doesn’t come for free, but it does at least provide some resilience to real-time changes in the database schema.

Another issue I have with most ORMs is that they tend to either use an existing database schema as the “Master Model” for the objects, or use XML/UML object definitions as the “Master Model” for the database, and merrily force changes from one to the other without allowance for external dependencies.  In an Enterprise context, this is a major failure, as it is rare that any single app or database works completely in isolation from all others.  In general, I’d prefer to see something like what Microsoft are building with SQL Server Modelling Services – where the model is the first class design artefact for both object-orientated and entity-relational design.

So… how do we deal with these entrenched dependencies?  There are a few different approaches, each of which might be more appropriate on a case-by-case basis based on their cost of implementation, support and ongoing ownership.

Option 1 (outlined for me by Paul Stovell from Readify at the recent CodeCampSA event) is to build and maintain a messaging infrastructure which may or may not feed a centralized operational data store (ODS) then submit messages to the bus from each application in the enterprise.  Each application can then have its own specific tightly bound database or document store, and developers can be very productive in servicing the business’s tactical needs.  This is a very agile approach that cuts down significantly on up-front design.  However, it also requires some compensatory effort in building the data integration infrastructure required to transmit and statefully persist messages.  Additionally, the savings generated from the development process are likely to be lost the minute any compliance requirement manifests itself, and data quality becomes a huge issue when there’s no central control over what content needs to be sent down the integration bus.  Getting a clear “single view of truth” on data in-flight on the bus is near impossible with this approach, and business intelligence efforts will be severely hampered.   This solution can work well in small-shop environments where application boundaries are clearly defined and there is little-or-no overlap in the data created and/or consumed by each individual application.

Option 2 is to store all operational information in a centralized ODS (e.g. ERP data store – which may span several physical databases, even though it represents a single logical data repository) from which a centralized view of truth can be quickly and easily determined.  However, this approach also comes with its own hazards.  Application developers are locked into working with the schema provided by the ODS designers (ERP system developers or in-house IT data governance group), which can limit their productivity to some degree, as they need to encode more sophisticated concurrency management capabilities into their applications.  This requirement for higher maturity design patterns also means that developers need a higher level of skill than might otherwise be required in simpler, more segregated environments.  This additional skill level does not come for free – developers with enterprise-class development skills also come with enterprise-class price-tags, so the cost of building and maintaining applications can be significantly higher.  Stacked dependencies on the core data store schema also means that many different applications may need to be changed to accommodate any discrete database change.  This results in highly bureaucratic change and configuration disciplines, which result in yet more costs.  On the upside, you generally get pretty good data quality out of this kind of system, so the cost of business intelligence is significantly reduced.

Option 3 is to go with an unplanned hybrid model of options 1 and 2.  You have a mix of core applications that hit the ODS directly and “out-rigger” tactical applications that use tactical data integration solutions to share data with the central data store.  In this kind of environment (probably the most common one seen in enterprises of any size), it is not uncommon to find data siloed into a variety of database management systems (e.g. DB2, SQL Server, Oracle, MySQL, even Access, FileMakerPro, etc), and integrated using whatever point solutions were easiest or mandated at the time they were built.  This is the classic “Hairball” scenario in which data quality is often compromised by a lack of enterprise architecture discipline, and untangling the dependencies that spring up in these environments can be catastrophically expensive.  It’s practically impossible to deliver enterprise-grade business intelligence solutions in these kinds of environments.  At best, ICT groups can deliver tactical BI solutions which could possibly be combined in a useful manner.  The ongoing cost of implementing business change in an environment such as this is irredeemably high.  Quite often, the fines from compliance failures can be smaller than the cost of preventing them.  Option 3 is not a happy place to be.  If this is your world, you have my sympathy.

Option 4 is a better place to be.  Option 4 is a blend of options 1 and 2, but adds a data governance layer across all application design and data integration activities.   Non-core tactical applications can be developed in isolation and connected to the ODS via messaging solutions provided that the messages conform to a centralized corporate standard.  These messages are not limited to distribution to the ODS however – they can also be passed into the BI system (Data Warehouse or Data Marts as appropriate) and to other applications.  Because all “outrigger” applications support a common set of data interfaces, they are highly sociable; and because the message bus has a consistent translation layer that allows messages to be merged into the ODS, the marshalling of data for a real-time “Common view of Truth” is not impossible.  This approach does come with some core caveats however. 

  1. "Tactical” applications need to honour corporate standards for message format and data completeness.  If they do not, you’re back in the chaotic world of option 3.
  2. Business rules are either applied within core applications, or within orchestrations on the message bus – but not both.  This eliminates duplication and inappropriate overloading of business rules.
  3. BI solutions track not only the content of the ODS, but also in-flight messages to ensure up-to-the-minute business data integration.
  4. A centrally managed enterprise data model is in place, which reflects not only the entities and attributes for data in the system, functional requirements (transformations, de-identifications, business rules, (de-)serialization, security and aggregations), but also the data life-cycle states (initial acquisition, transfers between systems, merging with the ODS/DW environment, de-identification steps, archiving, purging and also wire-transmission states where there may be encryption and/or other privacy requirements).

While these requirements may seem onerous, they do actually save money if applied rigorously and without excessive bureaucracy.  Centralized data governance is not always something you want a team of auditors to be put in charge of.  The data architects responsible for data governance need to have the trust of the business, and this is something that is earned, not simply granted. 

However, this is what architects do.  We sit in the grey zone between the business and technology stacks.  We help business people identify what technology solutions are worth investing in, and we keep technologists focused on the delivery of positive business outcomes.  We need the trust of both parties, and in the last couple of decades, there have been plenty of situations where we have let both sides down.  Architects need to climb out of the ivory tower and engage directly with technologists for the full life of every system.  Architects also need to be cognizant of business value and ensure that our designs and recommendations yield every cent of ROI promised.

And to me, the most critical architect of the lot is the data architect.  Failing to model and govern data and system-wide information behaviour effectively sets everyone else up for failure.  If a data architect can’t get the ears of individual application architects then they set their peers’ projects up for diminished value, if not outright failure.  If a data architect can’t get the buy-in from a solution or sales architect to ensure that the appropriate governance processes are properly scoped and delivered, then they can likewise take some ownership of the failure.  Data architects need to ensure that infrastructure architects understand the workloads and network traffic requirements that will need to be delivered in order to deliver a healthy solution.  And data architects need to have the ear of enterprise architects to ensure that the importance of doing data properly isn’t lost in translation between the EA and their executive clients.  Don’t get me wrong… the other folks still need to get their pieces right, but if a data architect gets it wrong, everyone else suffers.

This is one of the reasons I’ve been lobbying PASS to create a Data Architecture virtual chapter for SQL Server.  I’m starting to get some traction, so watch this space for more updates.

July 08, 2010

Perfect Information vs. Chaos

Buck Woody recently posted an interesting blog item on the perils of expecting "perfect information" for economic modelling. My response follows.

Information can theoretically be perfect, but it's existence relies on the assumption of a completely deterministic (rational) universe. I think we'd all agree that human behaviour is often so complex as to describe it as irrational, so all economic models necessarily have some sort of fuzz factor when you apply them to the real world. Asimov spoke a lot about these topics in the Foundation novels and I think he nailed it on the head.

This mania that rich western populations have around "command and control", "risk management", and the commoditization of labour through specialization (c.f. W.E. Deming, et al) is all based on the assumption that we can measure and manage every aspect of our physical, social and economic environments. However, when tested, this assumption fails as the respective systems ARE chaotic, DO NOT always behave deterministically, and the expenditure of energy and resources to bring order to the chaos is always orders of magnitude higher than the amount of resources and energy in the system being "managed".

The kinds of modelling under discussion would be much more successful if instead of trying to get more precise answers to silly questions, we started trying to understand more about the chaotic order in our systems, and start factoring entropic terms into our understanding of our environment. Chaos theory dictates that these systems will usually behave more or less deterministically, but will only do so predictably up to a point. We have the tools to start examining the relationships in our macro-ordered black box systems, but they are not the tools of nice, safe but ultimately inadequate tenth grade algebra. They are instead the tools of advanced statisticians, and we need to understand that the results are expressed in terms of probabilities, not certainties.  We need to build chaos and entropy into our models in order to understand the range of outputs we're likely to see given a fixed set of inputs.

Given the discussion above, it's interesting to note that some of the data mining algorithms in MS SQL Server allow us to start building these improved models.  We can look at clusters of results, perform sensitivity analyses on the input variables in our systems, and perform time-series analyses to predict future outcomes.  We can build Bayesian decision trees to analyze paths of least resistance to the outcomes we want.  And all of this can happen without having to build equations in a simple algebraic form to which some kind of mythical certainty can be ascribed.  The clever people at MS Research have already done most of the heavy lifting in allowing SQL Server to do this for us.

January 29, 2010

Three Things That Got Me Here

I have been tagged by Josef Richberg (@SQLRunner on Twitter) to blog on a meme covered by several SQL Server and .NET professionals since the concept got started around the beginning of January.  Apparently being tagged means I *must* continue the meme, so… in three movements… here we go.


As a kid, my parents bought me a few bits and pieces of Lego.  The first 10 minutes of any session spent with a new Lego kit would be spent following the assembly instructions to create the train, car, castle, etc; then the next 2 hours reassembling the pieces in as many ways as possible – along with the other Lego pieces in my collection.  I created aircraft, houses, road trains, castles, bunkers, spacecraft… whatever my imagination and the supply of pieces could conjure up. 

As I got older, my folks started to realize that I’d get a lot out of the Technic sets, and my Lego projects started to take on a far more mechanical focus.  I started assembling more complex mechanisms and towards the point where I put my Lego days behind me, I was building piston-driven engines, rack-and-pinion steering assemblies, cranes, gearing systems and so on.  Lego gave me a great leg up on abstract spatial thinking, and also helped me to acquire systemic design skills. 

Without these, my interest in data, design, architecture, IT and engineering in general would never have got to the point they are today, and I’d probably have ended up as either a musician or an economist.  Instead, I studied Mechanical Engineering at University, which I dropped out of in order to make a move into the IT industry.

The Piano

From the age of around 5, my parents kept a piano in the house.  By the time I was 6, I’d taught myself how to play simple versions of a number of nursery rhymes and Christmas carols.  This piano was replaced by the one they still own today when I was around 8 or 9, and in fifth grade I started piano lessons with a septegenarian nun named Sister Tarsissius. Sister T got me working on scales, triads and basic two handed piano playing.  I got a start with music notation, and ended up being bored out of my skull.  She retired a couple years later, and I started taking lessons from Mrs McMillian, the wife of one of the teachers at the school where dad worked as a Deputy. 

Mrs McMillian adapted quickly to the fact that Mozart studies were going to turn me off playing piano permanently, and instead started teaching me some basic jazz and blues.  I took to this like a duck to water, and 30 years after starting to play piano I still jam with bands, compose tunes, tinker with arrangements and so on.

More importantly, music helped me further develop abstract spatial thinking, pattern matching, and in the context of group performance, learning valuable skills that translated to enhanced social skills.

The System-80

At the age of about 12, a couple of my friends bought computers.  Once of them had a Vic-20, and the other a Commodore-64, and both systems supported an early version of the BASIC language.  After a while, I started to pester my parents to get a computer.  After about a year, they finally caved in and bought one of the first ever personal computer clones - a TRS-80 clone from Dick Smith Electronics called the System-80.  It sported 16KB of RAM, a built in data-set (tape drive capable of reading normal audio tapes) and a green-screen monitor capable of 3x8 pixel blog graphics, and ran a BASIC interpreter.

Within weeks, dad had subscribed to an X-80 computer magazine that packaged programs on tapes every issue or two.  My brother and I both played games on the machine, but I also took up an interest in writing my own programs.  A year later, I was writing PEEK and POKE commands to read keyboard input and write graphics to screen, writing simple shooter games, and writing my own choose-your-own-adventure games.

This experience drove my interest in software development, which persists to this day and has informed the choice I made to make a career for myself in IT.

The meme for the “Three Things” blog post demands that I now tag three other bloggers.  I’m feeling vaguely contrary right now, so I’m going to simply tag any reader who hasn’t already responded to this theme.  Next post?  The third and final post on my series on table metadata.


January 27, 2010

Mastering the Metadata Part 1 (Tables #2)

n the last Metadata-themed post to my SQL blog, I discussed some of the metadata used to define tables – with particular emphasis on their storage.  I didn’t go deep into what the metadata available in the storage-specific tables describes, but I did give a brief description of their purpose and the kind of information they contain.  Today, I want to focus in more on the logical structure of tables, with a view towards being able to reverse engineer DDL statements to create a copy of the table.

I’ve already covered sys.tables and sys.objects, so I won’t rehash that ground.  Each table object logically contains (although strictly speaking, is made up of) one or more columns.  Defining columns starts with sys.columns.

Object_id refers to the parent object, which for the purposes of this discussion will be a table.  If you only want to get the object name for a single table, you can dereference the object_id to an object name using the Object_Name(<object_id>) system function as a means to avoid having to do a join against sys.tables or sys.objects.  The name column specifies the name of the column.  The column_id field gives you the ordinal for a given column in the table definition.  e.g. if we were to build the following table we would see 1 as the column_id for the field MyTable_ID and 2 as the column_id for MyTable_StringContent.

If we were to remove the WHERE clause from this query, we could return these results for all objects in the database that have columns defined in the sys.columns view.

System_type_id and user_type_id refer to data types that can be retrieved from the sys.types catalog view.  Max_Length, Precision and Scale are used to specify metadata specifying the length of the data type in each column.  Max_Length applies to character data and binary objects.  Precision and Scale apply to Numeric (AKA Decimal) data types, where precision defines the total number of digits that can be stored, and Decimal defines the number of digits to the right of the decimal point.

A number of bit-flags showing the boolean state of various column options follows the length metadata.  Most of these can be ignored in the column definitions if they are not actively set, with the exception of the Is_Null flag.  T-SQL table definitions require that any NOT NULL constraints are specified inline in their column specifications.  NULLable (1) is the default behaviour here, so we need to make sure we capture 0s rather than 1s in this column for scripting and configuration capture purposes.

Finally, the integer fields at the far right of the sys.columns view are used to specify external object dependencies (e.g. XML schema collections, Rule objects and parent objects), while the final two bit columns specify the boolean column-state with respect to SQL Server 2008’s sparse column feature.

I made mention of another important catalog view above – sys.types – which describes the physical data types of a given column.  The system_type_id and user_type_id columns refer to fields in this view, and can be used as lookup keys to determine the data-type primitives and any user-type dependencies we need to consider before scripting tables.

Think for a second about the following example.

I’ve created a SQL Server user type based on one of the base server types, then assigned that user type to the MyTable_String_Content column.  Using the inner joins listed in the query which link the user_type_id from sys.columns to the name in sys.types, and the system_type_id in sys.columns to the user_type_id of the primitive for that type, you can see that myTable contains a myStringContent typed column as expected, and that it is based on an nvarchar base system type.  We could get the length of the user type by querying for Max_Length in the sys.types table.

We can also query the is_nullable flag to see if the type is nullable, is_user_defined to see if we need to perform the secondary join to get a primitive base type, and to see if the type includes a T-SQL Rule object binding.  Note: Many of the system types are specific variations on a more basic type.  As such, it can be worthwhile adding the second join anyway.

Some of you might be wondering now, how is this useful?  Well… let’s say I want to build a database configuration snapshot tool.  I can use queries such as the ones above to build an inventory of the column definitions for each table, and in conjunction with system catalog views discussed last week, I can also build a base definition for the table itself.  Anyone who’s found that the backups for their SQL Server instance weren’t in great shape immediately understands the value of being able to run a set of T-SQL table creation scripts to generate copies of your tables on a nice fresh SQL Server install.

Other practical applications for this kind of metadata are as follows:

  • Querying user databases to list base tables and primitive data types when pulling together information about source systems for a data warehouse, data quality management system or enterprise system bus solution.
  • Generating OO data layer components from table definitions using an ORM tool such as LLBGen, MyGeneration, NHibernate or similar products.

In the case of a data warehouse solution, the table definitions from source database schemas can be “decorated” with additional metadata that supports the management extract, transform and load (ETL) operations, system auditing, idempotent load operations (load operations can be reprocessed multiple times without having to clean up previous load attempts) and slowly changing diimensions (SCDs).

The next post in this series will be the final one on table metadata – I’ll be looking at constraint and index definitions, and highlighting the catalog views that allow us to capture and/or script the behavioural elements of table design, as opposed to the table’s structure and physical storage properties.  In the meantime, thanks for your continued interest, and Code Well, Code Often!

January 15, 2010

Goals, Themes, Memes and Mayhem

Okay… still not the promised blog entry on doing nifty stuff with table metadata.  Another response to someone else’s blogging.  I just can’t help myself… call me a sucker for the #SQLPASS community, now that I know it exists.  Short, sharp and shiny will this be.

There’s a meme for 2010 goal-stating blogs doing the rounds on SQL Server MVP blogs at the moment.  Luminaries such as Brent Ozar (@BrentO), Jeremiah Peschka (@peschkaj), Andy Leonard (@AndyLeonard), Kevin Kline (@kekline),  and a host of others have all tagged other bloggers or been tagged by other bloggers in a copy-cat craze that has swept the gl0bal interwebz.

So… with the Meme out the way, onto the goals.

I am not a Microsoft MVP.  Yet.  My first goal for the year is to take a good number of steps in that direction by ticking off a few of the reasonably well understood criteria for earning the award.  I’m currently writing two blogs about Microsoft products from two very different perspectives – this one as an “expert” (aka “A drip under pressure”, as we like to call them here in Australia); and the Ozziemedes blog where I disavow all knowledge of what I *should* be doing, but am sharing my learning experiences with others.

My second goal for the year is do some contracting and get enough cash together to successfully complete both  SQL Server MCITP and .NET MCAD boot-camps by the end of Q3.  I’d also like to see if I can earn myself an MCT cert in SQL Server.

The third goal of the year is to get a small business called “Poddify” off the ground.  The basic concept of Poddify is simple: Podcast production and Publishing for small businesses.  I have skills and experience in the audio engineering and content production areas which make this a good fit for me.  I’d like to be supporting myself financially with this business and MS Training by the end of 2010, with a bit of help from 6 months worth of contracting.

Goal number four is to have a detailed outline for a book on Enterprise-grade Data Architecture using SQL Server completed by the end of the year.  I’d like to get the outline in front of a publisher in early 2011 and have the book published by 2012.  I have thoughts and knowledge in this area, and I’m keen to make sure that they are not lost should I get hit by a bus.

The final goal of the year is to start working on some software projects that I’ve had rattling around the back of my cranium.  These projects will lead to other people’s lives becoming simpler, and hopefully some additional pull-through income for me.

So now we come to the Theme part of the Meme.  Most of the bloggers above have nominated a single word theme for the year.  Mine will be “Liberty.”  I’ve spent most of the last 14 years in this industry making other people rich, busting my arse 48 weeks a year for little appreciation and quite frankly, a lot less money than I’m worth.  That changes this year.  I’m taking back control of my professional life and will be investing my time not in “being more useful to my employer,” but in “taking better care of myself.”

So… shout out time.  I want to tag Rob Farley (@Rob_Farley) and Greg Low (@Greg_Low) for the great work they do supporting the SQL Server community in Australia.  Rob is an MVP and consultant who runs the SQL Server user group in Adelaide that I started before life as a salariman started taking over my time, and Greg is one of 3 Regional Directors in Australia.  His track record of presenting and community work speaks for itself.  I know there are plenty of other people I *haven’t* given shout outs to, but these guys are role models to me – they’ve earnt them by being relentless in support of the SQL Server community for several years.

So… here’s to 2010.  May it bring success to you all, Liberty to me… oh… and World Peace!


A Brief Diversion on Database Design

My promised follow-up on the sys.tables post is still in the skunk works.  I got distracted by an interesting post by Buck Woody of Microsoft over on his Blog on  The questions he posed are as follows:

1. What process do you follow?

2. How important are the business requirements?

3. What tool do you use to create the design, do you need it to diagram, do you even care about diagrams?

4. What’s your biggest pain-point about designing? Not with SQL Server, mind you, just the designing part.

If you’re a DBA, Data Architect or Data Modeller, feel free to head over to Buck’s blog and post your own response!

Given that my current blogging focus here is on metadata and data modelling for SQL Server, I thought I’d share my response.  Without further ado, here it is! 

1) I'm pretty much a traditionalist when it comes to DB Modelling.  I'm sure these approaches will be pretty familiar to you.

My DB Design work routinely starts with a big (A3 or bigger) piece of paper or a whiteboard and a domain map - for example, in a traditional order processing system I'd have a big box for Sales, another box for Products, another for Order Fulfilment, one for Employees, one for Suppliers, etc.

I then put boxes inside (or spanning) those regions, starting at the intersects with bridging entities e.g. Orders bridges Order Fulfilment and Sales, Products connects to Orders, Suppliers connects to Products, Employee Sales, etc.  I'll then work back towards the centre of each domain box with more specific entities (e.g. Order Details, Product Options, Employee Type, Sales Channel Type, etc.

Once I've sketched out the top level entities (Len Silverston's reference models are really handy for this) I then sit down with the clients/end-users and attack the model.  What bits have I missed?  What bits don't you need?  What are the business processes that are supported by each of these subject areas?  That usually leads to more entities, types, sub-types, etc.. and tends to surface parent-child hierarchies.

With the extra detail about the entities, I'll then start adding some meat to the bones.  What do you need to know about an order?  What information about an order do you need to fulfill it correctly?  What are the decision points around order processing?  Likewise for the other core entities identified.

By the time I'm done with this, my model will typically be in ERwin or Visio.  I usually have a pretty good handle on cardinality, required business information fields (e.g. Names, phone numbers, order quantities, etc), business metadata fields required to support business automation (e.g. order IDs, invoice numbers, payment IDs).  At this point I'll do a normalization and refactoring sweep to do things like convert subtypes to type detail tables with a type-parent intersect table.

Next step is adding unique keys.  I'll probably already have an Int/Decimal(N, 0) of some description doing duty as a surrogate key which also has a physical PK constraint defined on it.  I'll define unique constraints against natural keys.  I'll also start thinking about data domains (check constraints, defaults, rules, etc) and start adding additional "state" metadata - e.g. IsCourierDeliveryOnly on Product, Is401KEligible on Employee, etc.  I'll also have data-types on pretty much all of the fields in the model.

At this point, I'll put the model back in front of the stakeholders and do a walkthrough.  I'll correct clarified requirements, add new columns/constraints, refactor where necessary and make sure that all auditing requirements are clearly captured.  This usually gives me enough to freeze the model at a point where I can hand it to a dev team to start coding against it.

This is where I'll do a cut to SQL Files, import it into VSTS Dev/DB edition and start working on code-gen for standard CRUD procs, triggers and functions.  I'll keep tinkering with the model - adding metadata to support the generation of CRUD procs/functions/triggers (e.g. extended properties, etc), and move from managing the model as a single artefact to managing individual physical objects, synching the changes through source control on a per-item basis.

2) Business requirements are critical, but you can't do it all at once.  In my experience, end-users and stakeholders typically don't understand enough about their processes to come up with all of the required information at once.  I'm convinced you need at least 2, and up to as many as 5 detailed passes (if it's a big app) on the model to get to a point where the model stands up to whatever the devs and users throw at it.

3) Not only will I put the model into a diagram, but I'll try to make the diagram as readable as possible by eliminating crossing relationship lines where possible.  Readability is really important, because when a business user comes to me and says "Did we capture XYZ" I can find the entities concerned and trace their relationships quickly without having to navigate a huge snarl of overlapping relationship lines.  As discussed, ERwin and Visio are tools of choice - but if you have ANY pull with the Visio people, can you please get them to unseal the DB modelling classes so we can do automation against them in the same way as we can do in ERwin.

4. Biggest pain points? 

A) The lack of policy based design in most modelling tools - e.g. It would be fantastic to have Visio not only telling me that I have a subtype relationship that needs refactoring, but automatically refactoring it for me when I move to physical model mode using a configurable refactoring policy for which I can control the templates. 

B) The lack of XMI formatted export capabilities in most modelling tools.  From memory, it was possible to export SQL Server 2000 Metadata Services models to XMI, but since that product has been deprecated, we're stuck with forward engineering to SQL and having to export XSD's via "SELECT ... FOR XML SCHEMA". 

If the "M" language in Oslo fixes these issues, then I'll be thrilled.  As a data architect, I'm excited by what SQL Server Metadata Services offers, not only for SQL Server data design, but for pushing data designs deeper into the solution stack - e.g. generating biztalk schemas and XSL transformation defintions for common transactions based on updateable views, generating WSDL method signatures for those XSDs, generating data-grid controls from those views, generating data access layers from SQL (or model) metadata, etc.

Stay tuned for more on table metadata in the next day or so.


January 13, 2010

Mastering The Metadata: Part 1 (Tables #1)

In this post, I’m going to outline the core metadata SQL Server 2008 uses for Tables, and how you might be able to use this metadata in a useful manner.  In this post, I’ll only go as far as outlining the potential for using table metadata to discover storage characteristics of the table.  I’ll provide more detailed examples in my next post of how you can put this to work.  Before we begin, a brief history lesson to do some context setting.

Once upon a time in the distant past, a relational database product called SQL Server was birthed unto the world.  Twice.  Once by a company called Sybase, and later, by an upstart operating system developer called Microsoft.  Microsoft helped Sybase and Ashton-Tate (developers of the dBase DBMS) to port SQL Server to their LAN-Manager 2.0 operating system, which later evolved into Windows NT 3.0.  When Ashton-Tate dropped out of the mix, Microsoft obtained a license from Sybase to develop the product on the MS Windows platform.  In the meantime, Sybase continued to develop their product for various *NIX flavours and other PC-based platforms.
As the products evolved, both Microsoft and Sybase both described new functionality through additions to the system catalog.  Microsoft went through an major leap at the SQL Server 7.0 release, which moved away from the old Sybase code-base to one that was more open, extensible and standards-compliant. Sybase and Microsoft took different approaches to managing catalog changes.  Microsoft maintained a copy of the old catalog, first as tables, to which additional fields could be appended using a shared PK; then moving on to creating new tables to replace the old ones, but adding views with the name of the old catalog tables.  Between SQL Server 7.0 and 2008 there’s been a number of changes to the approach.  In SQL Server 2008, all system metadata is now stored in the resource database (including user DB schemas).  Schema information for each database is referenced via views in the “sys” schema.  A version of the old “sys<blah>” SQL Server 6.x schema objects remains for backward compatibility purposes, but I’ll be describing metadata in the newer format. 
Sorry for the diversion, but sometimes a little background makes the present situation much easier to understand.  Onwards to Table Metadata!

Here we are, back at the original topic of table metadata.  Let’s start with sys.objects.


Sys.Objects contains a row for each object in the database.  In data modelling terms, it serves as a master type for a number of more specific types, which we know as tables, views, stored procedures, constraints, keys, functions, triggers, etc.  The single most important thing to know is that sys.objects lets us get an object_id for our object using a join (instead of a function) which can yield better performance if we’re loading information about a LOT of objects at a time.  If there’s any overloads on object names (e.g. objects with the same in different schemas/owners within a database, or views over a table with the same name) we can also extract keys for the schema, owner and object type from this view.  Ultimately, what we’re after in most cases is a single unique set of identifiers for the objects we want information on.

The rest of the metadata in this view provides details on the object’s lineage (create/modified date, whether it’s been created by Microsoft or a third party) and whether the object  The is_published and is_schema_published columns are replication metadata.  If is_published = 1 (true) then the object is published for replication.  If is_schema_published = 1 (true) then the object definition itself is also published via replication.


In data modelling terms, the sys.tables view is effectively one of several subtypes of the sys.objects view.  As such, it contains information that is common with its logical parent entity.

You can see in the SSMS snip above that name, object_id, and so on are all identical to the records in sys.objects down as far as is_schema_published.  From there on, all of the columns are specific to table objects.  These columns are described in detail in the MSDN Library for SQL Server 2008.
A number of SQL Server 2008 specific columns are worth pointing out.
The following columns in sys.tables are useful for identifying the placement of table data of various types (on either a file-group or partition schema).
  • lob_data_space_id
  • file_stream_data_space_id
These fields can both be joined against the data_space_id column in the following catalog views:
  • sys.filegroups (if you want to find the File Group name)
  • sys.partition_schemes (if you want to find the name of the partitioning scheme to bind the table to)
  • sys.indexes (if you want to identify the file-groups that indexes attached to this table are defined on)
  • sys.allocation_units (useful when you want to find the number of pages allocated to the table, its indexes and/or out-of-row LOB storage)
  • sys.fulltext_catalogs (useful for finding paths to system catalogs – be warned though – this view is due to be changed in a future release, so don’t hard-code references to data_space_id in this view into applications)
  • sys.data_spaces (useful for getting a name back for a data space if you don’t need to distinguish between file-groups and partition schemes)
  • sys.destination_data_spaces (useful if you want to get information about the number of potential destinations associated with a partition scheme for a given table)
  • sys.master_files (useful for enumerating all of the files that a given table will be written to and/or the types of those files – note this view returns file records for ALL databases on the server)
  • sys.database_files (functionally identical to sys.master_files except only files within the current database context can be shown)
These columns can also be joined against the backupfilegroup table in the MSDB database, using filegroup_id as the join key (useful to identify whether a given table was backed up during a given backup). 
Now… have a think about where all this could be useful.  If you were writing an script/application whose purpose was to enumerate a specified set of tables, and identify
To be continued… 

What’s Next?

Sadly, I’ve run out of time, and this post is already quite large.  In the next post, I’ll dive into some examples of how you can put some of the table storage lookups listed above to work, and why they might be important.

Useful references

Microsoft have included a good article on how to use some of the system metadata in the schema catalog views in this article.  The following links lead to the MSDN library pages for the catalog views and tables mentioned in this post.

There was an error in this gadget