Search This Blog


August 12, 2010

Musings on Good Data Architecture

I was at a user group meeting tonight where a couple of developers were presenting on Entity Spaces.  It was interesting to see the tool in action, but I can’t help but feel a little cautious about the kind of strong typing that’s going on in the myriad of class files that are getting built by the code generation framework.  To me, it seems risky from a complexity perspective. 

In my opinion, a good ORM should be able to abstract the schema from the object-level implementation and provide a software-factory driven solution that can create the required Data Access Layer objects at run-time, and wrap the stored procedures, views and functions in the database with appropriate invocation interfaces.  Late-binding doesn’t come for free, but it does at least provide some resilience to real-time changes in the database schema.

Another issue I have with most ORMs is that they tend to either use an existing database schema as the “Master Model” for the objects, or use XML/UML object definitions as the “Master Model” for the database, and merrily force changes from one to the other without allowance for external dependencies.  In an Enterprise context, this is a major failure, as it is rare that any single app or database works completely in isolation from all others.  In general, I’d prefer to see something like what Microsoft are building with SQL Server Modelling Services – where the model is the first class design artefact for both object-orientated and entity-relational design.

So… how do we deal with these entrenched dependencies?  There are a few different approaches, each of which might be more appropriate on a case-by-case basis based on their cost of implementation, support and ongoing ownership.

Option 1 (outlined for me by Paul Stovell from Readify at the recent CodeCampSA event) is to build and maintain a messaging infrastructure which may or may not feed a centralized operational data store (ODS) then submit messages to the bus from each application in the enterprise.  Each application can then have its own specific tightly bound database or document store, and developers can be very productive in servicing the business’s tactical needs.  This is a very agile approach that cuts down significantly on up-front design.  However, it also requires some compensatory effort in building the data integration infrastructure required to transmit and statefully persist messages.  Additionally, the savings generated from the development process are likely to be lost the minute any compliance requirement manifests itself, and data quality becomes a huge issue when there’s no central control over what content needs to be sent down the integration bus.  Getting a clear “single view of truth” on data in-flight on the bus is near impossible with this approach, and business intelligence efforts will be severely hampered.   This solution can work well in small-shop environments where application boundaries are clearly defined and there is little-or-no overlap in the data created and/or consumed by each individual application.

Option 2 is to store all operational information in a centralized ODS (e.g. ERP data store – which may span several physical databases, even though it represents a single logical data repository) from which a centralized view of truth can be quickly and easily determined.  However, this approach also comes with its own hazards.  Application developers are locked into working with the schema provided by the ODS designers (ERP system developers or in-house IT data governance group), which can limit their productivity to some degree, as they need to encode more sophisticated concurrency management capabilities into their applications.  This requirement for higher maturity design patterns also means that developers need a higher level of skill than might otherwise be required in simpler, more segregated environments.  This additional skill level does not come for free – developers with enterprise-class development skills also come with enterprise-class price-tags, so the cost of building and maintaining applications can be significantly higher.  Stacked dependencies on the core data store schema also means that many different applications may need to be changed to accommodate any discrete database change.  This results in highly bureaucratic change and configuration disciplines, which result in yet more costs.  On the upside, you generally get pretty good data quality out of this kind of system, so the cost of business intelligence is significantly reduced.

Option 3 is to go with an unplanned hybrid model of options 1 and 2.  You have a mix of core applications that hit the ODS directly and “out-rigger” tactical applications that use tactical data integration solutions to share data with the central data store.  In this kind of environment (probably the most common one seen in enterprises of any size), it is not uncommon to find data siloed into a variety of database management systems (e.g. DB2, SQL Server, Oracle, MySQL, even Access, FileMakerPro, etc), and integrated using whatever point solutions were easiest or mandated at the time they were built.  This is the classic “Hairball” scenario in which data quality is often compromised by a lack of enterprise architecture discipline, and untangling the dependencies that spring up in these environments can be catastrophically expensive.  It’s practically impossible to deliver enterprise-grade business intelligence solutions in these kinds of environments.  At best, ICT groups can deliver tactical BI solutions which could possibly be combined in a useful manner.  The ongoing cost of implementing business change in an environment such as this is irredeemably high.  Quite often, the fines from compliance failures can be smaller than the cost of preventing them.  Option 3 is not a happy place to be.  If this is your world, you have my sympathy.

Option 4 is a better place to be.  Option 4 is a blend of options 1 and 2, but adds a data governance layer across all application design and data integration activities.   Non-core tactical applications can be developed in isolation and connected to the ODS via messaging solutions provided that the messages conform to a centralized corporate standard.  These messages are not limited to distribution to the ODS however – they can also be passed into the BI system (Data Warehouse or Data Marts as appropriate) and to other applications.  Because all “outrigger” applications support a common set of data interfaces, they are highly sociable; and because the message bus has a consistent translation layer that allows messages to be merged into the ODS, the marshalling of data for a real-time “Common view of Truth” is not impossible.  This approach does come with some core caveats however. 

  1. "Tactical” applications need to honour corporate standards for message format and data completeness.  If they do not, you’re back in the chaotic world of option 3.
  2. Business rules are either applied within core applications, or within orchestrations on the message bus – but not both.  This eliminates duplication and inappropriate overloading of business rules.
  3. BI solutions track not only the content of the ODS, but also in-flight messages to ensure up-to-the-minute business data integration.
  4. A centrally managed enterprise data model is in place, which reflects not only the entities and attributes for data in the system, functional requirements (transformations, de-identifications, business rules, (de-)serialization, security and aggregations), but also the data life-cycle states (initial acquisition, transfers between systems, merging with the ODS/DW environment, de-identification steps, archiving, purging and also wire-transmission states where there may be encryption and/or other privacy requirements).

While these requirements may seem onerous, they do actually save money if applied rigorously and without excessive bureaucracy.  Centralized data governance is not always something you want a team of auditors to be put in charge of.  The data architects responsible for data governance need to have the trust of the business, and this is something that is earned, not simply granted. 

However, this is what architects do.  We sit in the grey zone between the business and technology stacks.  We help business people identify what technology solutions are worth investing in, and we keep technologists focused on the delivery of positive business outcomes.  We need the trust of both parties, and in the last couple of decades, there have been plenty of situations where we have let both sides down.  Architects need to climb out of the ivory tower and engage directly with technologists for the full life of every system.  Architects also need to be cognizant of business value and ensure that our designs and recommendations yield every cent of ROI promised.

And to me, the most critical architect of the lot is the data architect.  Failing to model and govern data and system-wide information behaviour effectively sets everyone else up for failure.  If a data architect can’t get the ears of individual application architects then they set their peers’ projects up for diminished value, if not outright failure.  If a data architect can’t get the buy-in from a solution or sales architect to ensure that the appropriate governance processes are properly scoped and delivered, then they can likewise take some ownership of the failure.  Data architects need to ensure that infrastructure architects understand the workloads and network traffic requirements that will need to be delivered in order to deliver a healthy solution.  And data architects need to have the ear of enterprise architects to ensure that the importance of doing data properly isn’t lost in translation between the EA and their executive clients.  Don’t get me wrong… the other folks still need to get their pieces right, but if a data architect gets it wrong, everyone else suffers.

This is one of the reasons I’ve been lobbying PASS to create a Data Architecture virtual chapter for SQL Server.  I’m starting to get some traction, so watch this space for more updates.


  1. Glad you attended the user group meeting. EntitySpaces is architected the way it is because it's the way we like to work, it's extremely fast, maintenance free, database independent, and extremely flexible. We specifically don't like modeling our code so that it's different than our database. I have never bought into that. EntitySpaces is used in some very complex sites including fortune 50's and 500's, by the defense industry, many government sites, hospitals, and in many Compact Framework applications and even under Mono quite often. We've been around since 2006 so we are doing something right ;)

  2. Thanks for the comment. :) I'm not necessarily bagging Entity Spaces per se... just stating that in my opinion the jury's still out on the approach of generating loads of files - it seems to me to be a brittle architecture if someone who doesn't know ES well starts tinkering with the generated files. This is why I prefer the runtime factory approach - people can't ignorantly dig into the innards and break the workings of the framework. When you have the power of generics, lambda expressions and anonymous functions at your disposal in the .NET Framework, I think the opportunities for effective late-bound solutions are far greater than they were when MyGeneration and Entity Spaces first started kicking around.