• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Articles
  • News
  • Events
  • Advertize
  • Jobs
  • Courses
  • Contact
  • (0)
  • LoginRegister
    • Facebook
    • LinkedIn
    • RSS
      Articles
      News
      Events
      Job Posts
    • Twitter
Datafloq

Datafloq

Data and Technology Insights

  • Categories
    • Big Data
    • Blockchain
    • Cloud
    • Internet Of Things
    • Metaverse
    • Robotics
    • Cybersecurity
    • Startups
    • Strategy
    • Technical
  • Big Data
  • Blockchain
  • Cloud
  • Metaverse
  • Internet Of Things
  • Robotics
  • Cybersecurity
  • Startups
  • Strategy
  • Technical

Data Supply Framework 3.0 ETL Patterns

Martyn Jones / 14 min read.
March 14, 2017
Datafloq AI Score
×

Datafloq AI Score: 79

Datafloq enables anyone to contribute articles, but we value high-quality content. This means that we do not accept SEO link building content, spammy articles, clickbait, articles written by bots and especially not misinformation. Therefore, we have developed an AI, built using multiple built open-source and proprietary tools to instantly define whether an article is written by a human or a bot and determine the level of bias, objectivity, whether it is fact-based or not, sentiment and overall quality.

Articles published on Datafloq need to have a minimum AI score of 60% and we provide this graph to give more detailed information on how we rate this article. Please note that this is a work in progress and if you have any suggestions, feel free to contact us.

floq.to/i8UJX

This article is the first in a series of articles that discuss aspects of the use of architectural patterns in the Cambriano Information Supply Framework 3.0

The term architectural pattern may sound grand, misleading or daunting, but its really quite a simple concept. Its like writing a function in a programming language to log in to a database, check that the connection is alive and working and report back the success of the connection request. If that function can be reused either in the same application development, in the same IT shop or in IT in general (e.g. Java code to connect and test the connection to SQL Server) then its well on its way to becoming an architectural pattern. Of course, there are much more sophisticated architectural patterns. But generally a pattern is a simplified and generic template for address a generally occurring problem. But as with much in architecture, less usually turns out to be more.

In this article I will be looking at patterns for the process known as ETL (Extract, Transform and Load), which is the typical mechanism used to take date from source systems (which may be systems of record) through transformation (to put the data into a conformed format that real people can understand and use) and finally to load the resulting data into the Enterprise Data Warehouse and subsequently to be loaded and used in departmental oriented, discipline focused or special-project Data Marts.

The Cambriano Information Supply Framework 3.0 is the architectural blueprint for the integration of Big Data, Analytics (including Textual Analytics) and Structured Intellectual Capital (Knowledge Management) with Bill Inmons definitive Enterprise Data Warehouse architecture and process.

Links to more information about the Data Supply Framework (including 4<sup>th</sup> generation Data Warehousing) are listed at the end of this article.

In subsequent articles I will be looking at architectural patterns across the DSF 4.0 landscape, including patterns for Big Data and Analytics.

The following diagram illustrates the overall drivers for the Information Supply Framework.

Information Supply Framework

Fig. 1 DW 3.0 Information Supply Framework

From the right, the consumers and prospective consumers of data and information create the demand for information and data.

The middle process of data processing, enrichment and information creation strives to meet the business demands and also provokes business demands for data and information.

From the left, the data sources provide data to meet real and secondary demands.

As stated previously, this article will focus on ETL architectural patterns as they apply to conventional landscape architecture patterns in Enterprise Data Warehousing (EDW / DW, Data Marts, Operational Data Store) and will not include patterns such as Data Governors or the Analytic Data Store (which I shall cover in later articles).

For those unfamiliar with the term ETL, it is the most common form in IT of moving data from a source of data (such as an operational database containing customer information) to a target database (such as one used for reporting). Its a logistics method for data (it copies data from where it is to where it is needed) and its a selection, manufacturing and packaging method (for selecting the data that is needed, enriching the data as needed and delivering the data in a form that can be used).

Anyway, the ETL I am focusing on in this article is highlighted (by the green boxes) in the following high-level conceptual view.

Information Supply Framework

Fig. 2 DW 3.0 Information Supply Framework

Analysis Patterns for ETL

I will not dwell too long on Analysis Patterns for ETL as this requires separate treatment from the discussion of ETL architectural patterns.

Needless to say, the usage of ETL Analysis Patterns should be a constant in Data Warehousing, regardless of whether the patterns used are client specific, industry specific or generic.

Typical analysis patterns cover:

  • Structured requirements gathering and knowledge elicitation
  • Data steward/platform owner exploration workshops and data provision negotiations
  • Comprehensive source to target mapping, including the collection, enrichment and communication of business, management and technical metadata

I will be dealing with these and other pattern areas in subsequent articles.

Why do we use ETL patterns?

Creating reusable code structure and logic is significant in most IT development, and just as important in ETL processes and structures. The modularisation of the ETL process helps avoid unnecessarily rewriting the same tested code, and reduces the total effort required to build and maintain ETL architectures.

ETL patterns can also be used by developers who have little or no idea of what they are doing, but will know enough to be able to drop in a component knowing what that components inputs and outputs should notionally be. This is probably why the approach is so popular with offshored project development.

Architectural Patterns in ETL

Now we have reached the main focus of this article. So, the ETL architectural patterns I wish to address briefly are:

  1. Data extraction patterns including early change data capture patterns
  2. Data mapping patterns
  3. Data cleansing patterns
  4. Data transformation patterns
  5. Data loading patterns including change data capture patterns
  6. Post-load processing patterns including late change data capture patterns
  7. Metadata management patterns business, technical and management metadata
  8. Logging and auditing patterns
  9. Error handling
  10. Process and data integrity

Given the nature of this type of article, the list is focused on major parts. Of course, there will be some areas of overlap in the process areas mentioned above, and some of these overlaps will be identified and explained shortly.

As hinted at previously, this article does not include a detailed discussion of streamed, queued or governed data mechanisms. These topics will be addressed in later articles in this series.

So, without more ado lets get the ball rolling.

Simple data extraction patterns Source / Intermediate / Target EDW

Simple data extraction patterns are required where data is not sourced directly from the system of record in a three step process, but where data is staged (in one form of another) in a staging area that typically sits between the source systems and the target EDW platform.

The following patterns are frequently used in staging:

  • Full or partial export from operational databases to intermediate file store staging area as (ASCII) files
  • Full or partial export of operational database reference data and operational transaction logs to intermediate file store staging area as files
  • Full or partial export replication of operational databases to an equivalent database management system platform for staging
  • Full or partial export of operational database reference data, transaction data and/or operational transaction logs to an equivalent database management system platform for staging
  • Database specific export to export format file copied to a staging area for subsequent unpacking and processing
  • Patterns for taking data off queues or streams and stored in text files or directly to an intermediate database management system such as Greenplum, Postgres, Oracle, SQL/Server, EXASol, Vertica or Teradata.

There are also patterns for using Distributed File Systems (such as MapR or Hortonworks) for intermediate ETL processing, whereby the commodity platform is used as a staging pattern.

Most of this patterns are relatively simple and straightforward, and dont actually require an immense amount of work even if they are reinvented as patterns specific to the IT shop of a particular business. They can come in handy when there is little to no experience in how to go about doing these things.

Data mapping patterns EDW

There are a number of architectural artifacts available that provide patterns for the process of mapping data through from source to target.

These patterns can take a number of forms:

  • Spreadsheet layouts for source to target mapping these can be used in conjunction (or not) with code to extract the metadata and generate some of the ETL processes. Teradata consultants in the Nordic countries have used ETL generation processes which basically take source-to-target mapping defined in MS Excel and convert that data to simplified ETL processes.
  • On a more sophisticated level, IBM, for example, provide a software utility (Graphical Data Mapping Editor IBM Integration Bus V9.0) to use in the analysis phase which allows for the creation and management of partially directed data mapping patterns. This approach is no specific to EDW.
  • Another tool sub-class, also provided by IBM and others, are straight EDW oriented mapping tools. IBMs mapping tool is named InfoSphere Data Architect (IDA). These tools provide an environment to easily create and modify mappings; to provide a good level of version control; a tool for easily reporting on mapping and mapping progress; and that allows for mapping data to be exported to other formats.
  • All major ETL tools provide a full set of mapping capabilities, and this includes InfoSphere (as mentioned previously), SAP BusinessObjects (Data Integrator / Data Services), Talend, Oracle ODI, Syncsort, Tibco, Pitney Bowes, Open Text, Sybase ETL, Informatica, Ab Initio, Altova, Clover ETL, Pentaho, SAS Data Integration Studio and Microsoft SSIS. ETL architectural patterns provided with these tools also cover most of the other areas as mentioned in this article (from extract, through transformation, to load EDW and Data Marts).

Some versions of mapping patterns, templates and software are available for free (as in community releases or express versions) or versions made available on a fixed-term or use-oriented trial basis.

Another facet that must be taken into account, moving forward, is the rise of Textual Analysis. There are Textual ETL tools on the market, but as yet there has not been much by way of Textual Analysis and Textual ETL architectural patterns.

Data cleansing patterns

IBM and Pitney Bowes have spent time and effort on building up knowledge and experience of data cleansing patterns, not only in Data Warehousing but also in the broader area of data integration.

IBM described data cleansing patterns in these terms:

The data cleansing pattern specifies rules for data standardization, matching, and survivorship that enforce consistency and quality of the data it is applied against. The data cleansing pattern validates the data (whether persistent or transient) against the cleansing rules and then applies changes to enforce consistency and quality This pattern specification helps you, as a data or application architect, to make informed architectural decisions and to improve decision guidelines. Source: https://www.ibm.com/developerworks/library/ws-soa-infoserv3/

Typically the recycling of data cleansing patterns is achieved through the delineation of cleansing rules that can be applied to the field, record and data object level through a service, stream or through a batch process for data. The bulk of which should ideally occur before and not during the transformation steps. The second area of pattern usage is the reuse of these rules in an OLTP/operational environment for the execution of the cleansing rules.

Data transformation patterns EDW

What I want to focus on briefly here are transformation pattern types and transformation rule patterns.

There are many formalized architectural patterns for theoretical source to target transformation, however, they are generally not available as plug, modify and play patterns and tend to be untested approaches to unstated or exotic requirements.

As mentioned previously, some of the major ETL (and Data Integration) software vendors do provide generic sets of patterns for data transformation.

Data loading patterns EDW

EDW data patterns cover a wealth of pattern areas, including related to full and partial loads; stereotyping and partitioning; parallelism; replication: change data capture/recapture; data aging and archiving; historicity; categorization and classification, etc. They will also encompass patterns used to connectivity; security; logging; fail-safe or fail-soft recovery; management metadata collection; and, auditing.


Interested in what the future will bring? Download our 2023 Technology Trends eBook for free.

Consent

Please note that I am not addressing the creation of hub and spoke dimensional models in this segment. For clarity and coherence, the loading patterns for the EDW will not be of the same class and nature as for the data loading patterns for Data Marts and Dimensional Models.

Typical patterns in EDW loading involve the use of fast, short and simple paths to data ingestion. These patterns will be determined quit significantly by the nature of the underlying infrastructure technology. It is not the same having a superfast and reliable load pattern for Oracle12C as it is for Teradata or for SQL/Server, for example. Or for that matter, the platforms they are running on.

Patterns may also be provided to load data into data models created using 3NF, ER modeling and Data Vault modeling. Typically, dimensional modeling will not be a part of the EDW and will therefore not have associated patterns.

Data transformation patterns Data Marts

These are patterns that are used to transform data located in the EDW into data that is used in Data Marts.

Generally speaking there should be a limited degree of transformation of data carried out. What transformation that is done should generally be constrained to the aggregation of data and the creation of additional dimensional facets. Therefore, there is scope for the use of a small amount of highly reusable patterns for dimensional transformation, but not for data transformation itself.

Also, no data cleaning should be contemplated at this stage. Data quality needs to be fixed before the data is stored in the EDW (although data may be staged on the same platform and in the same DBMS).

Data loading patterns Data Marts

There will be initial and incremental loads of Data Mart data.

Alternatively, one may to choose to completely refresh the Data Mart data, as and when required.

Most Data Marts will be built using the database structure commonly known as the Dimensional Model, and made famous by Ralph Kimball, the father of complex business SQL and a leading dimensional model evangelist.

Ralph has come up with a whole raft of ways that data can be ingested into a Data Mart, and all of these types can be captured and communicated and reused using ETL patterns.

In brief. Here are the list of the slowly changing dimension types as defined by Kimball:

Table

Fig. 3 Slowly Changing Dimension Types

Post-load processing patterns EDW

Not all transformation will be done before data is loaded. Sometimes it is more expedient and performant to load data and then transform it. This will depend on needs.

To some extent the ELT process lends itself to patterns, but be very wary of trying to over-pattern everything. If the cardinality of the applicability of any pattern is low, it might imply that the cost of maintaining such a pattern actually outweigh the benefits.

An interesting aspect of patterns has emerged since the PaaS trend has picked up. For example, there are data loading guidelines for SQL Data Warehouse. Several common loading options are described, such as SSIS, BCP, Azure Data Factory (ADF), and SQLBulkCopy, but also for PolyBase technology, which is apparently the preferred and fastest loading method for ingesting data into SQL Data Warehouse.

ETL and database technologies provide many options to load data as discussed in this article. Each tool and method have its own advantages and disadvantages. In Micsofts case, its easy to lift-shift-and-drop your existing ETL packages, scripts and other Control-node client gated methods to mitigate migration effort. However, if you require higher speeds for data ingestion, consider rewriting your processes to take advantage of PolyBase with its high throughput, highly scalable loading methodology.

Other vendors have similar approaches, with their own unique selling propositions.

Meta-data management patterns

Unsurprisingly, one of the ways of automating ETL pattern development has been via the metadata route. For example, Microsoft SSIS can be used in conjunction with a rich set of metadata to generate reusable SSIS packages for data extraction, transformation and loading. SSIS control flows are generated directly from the metadata and imported into SSIS.

In the Microsoft SSIS example (published by Microsoft), the run-time environment of the platform is based on a controller/worker architecture. The following sections provide more detail on distributed execution of SSIS packages and unified logging by using controllers and workers.

The dearth of meta-data management patterns in ETL compounds the fact that there is in general a dearth of ETL patterns. This may have something to do with the lack of widespread genericity in Data Warehousing ETL processing, which unlike much of solutions architecture at the application level, goes beyond a clear set of enterprise application building blocks associated with CRUD activities, transaction data, reference data and master data.

The fact of the matter is that although there has been much research into the automation of Data Warehousing, end especially ETL, the coverage provided the efforts to date does not meet the minimum needs of many organisations allured by the promise of such technologies. Yes, they do have their niche, but they are certainly not receiving widespread acclaim, and thats for a reason. Applicability, usability and cost blind ETL automation can also lead to foreseen risks and unforeseen costs that wipe out any financial benefit up-front.

Logging and auditing patterns

Use logging and auditing patterns throughout your ETL process landscape.

Ensure that all generic aspects of exception handling have corresponding exception handling patterns.

Ensure that generic patterns are available for generic aspects of ETL process auditing.

Remember, there may be a legal requirement to show that the ETL process has been correctly audited each and every time.

Better still, buy tools or add-ons that provide comprehensive logging and auditing.

There is no point in reinventing the wheel in exactly the same way, and there is no point in reusing patterns for the sake of development dogma.

Also, before I forget. Ensure all aspects of any developed or provided ETL functionality is well documented, especially if it is reusable pattern functionality and structures.

Last, but not least, dont forget to use version control on all development aspects of ETL.

Testing patterns

I want to deal with the subject of Data Warehousing and ETL testing (including test patterns) separately. But for completeness it also appears here.

Needless to say. When it comes to table to table and field to field testing, this is where reusable dynamic patterns should come into their own.

At the moment there are not a wealth of available ETL test patterns in the public domain, as testing has always been a bit of an afterthought in Data Warehousing, however, there is a growing realization of the importance of testing, especially in cases where regulatory reporting depends on Data Warehousing.

It is hoped that the ETL tools themselves will provide the test pattern functionality built-in which will remove the need for alternative means to design, build, and test and document ETL test patterns.

Lets see if the ETL vendors step up to the plate.

Automatization patterns

ETL and data warehouse scheduling and automisation patterns will be addressed in a separate article on Data Warehousing Automisation and Scheduling.

Thats all folks

As for much of these aspects of architectural patterns, there isnt a wealth of information and usable patterns in the public domain, and this is further compounded if the tool you happen to be using is rather esoteric.

As this first of a series of articles I would like to make a personal observation regarding patterns. Patterns are not new, they are another name for something that professionals in the IT and other industries have been creating and applying for decades, even the COBOL shop in the company where I had my first IT related job used patterns, which they called copy-books. Patterns are not new nor revolutionary, they are just another way to create, reuse and extend functionality, process and structure.

As an architect I am familiar with building blocks, components, libraries, blue prints, reference architectures, industry models, nonspecific artefacts, genericity and polymorphism, and a long list of etceteras. So the use of the term pattern does not actually introduce anything new. Which in the big scheme of things does not matter too much.

However, and to reiterate. As a Data Architect I think that documentation of design patterns aligned to the definition and use of architectural building blocks should be used to direct and guide deployment of solution building blocks and reusable solution functionality and structures. Patterns, models or library functions.

Bottom line? Reusable architectural patterns are useful, but dont expect to be able to do more than 20% of your ETL with generic patterns. Unless your business is super-generic or your ETL needs are minimal.

Until the next time. Have fun!

Many thanks for reading.

If you have any questions regarding this article then please feel free to connect with me here on LinkedIn or GoodStrat.com, or leave a comment below.

Categories: Technical
Tags: Big Data, ETL, framework, Metadata, patterns

About Martyn Jones

Martyn's range of knowledge, skills and experience span executive management, organisational strategy, strategic business performance and information management, leadership, business analysis, business and data architectures, data management, and executive and team coaching.

Martyn has worked with and advised many of the world's best-known organisations including Adidas, Banco Santander, Bank of China, BBVA, Boston Consulting Group, British Telecom, La Caixa, Central Statistical Office (UK), Central Statistical Office of Poland, Citco, Citigroup, Credit Suisse, E.On, Eroski, European Union, Fnac, France Telecom, Hewlett Packard, Iberdrola, IBM, Iberia, Infineon, T rkiye ' , Metropolitan Police, Movistar, NCR, National Health Service (UK), Office of the Governor - State of California, Oracle, The Home Office (UK), Rolls-Royce Marine Power Operations, the Royal Navy, Shell, Swiss Life, TSB, UBS, Unisys, the United Nations and Xerox, among many others.

He currently focuses on helping clients to:

-' Create relevant, understandable and actionable information
-' Plan, manage, design, develop and deliver information supply frameworks for the timely, appropriate and adequate supply of information
-' Design, develop and deliver beneficial, tangible and usable strategic performance and information frameworks
-' Design, develop and deliver relevant and coherent performance models, indicators and metrics
-' Plan, manage, design, develop and deliver information and data analytic strategies
-' Design, develop and deliver management informational insight and dynamic feedback solutions
-' Coach teams in measuring and managing performance
-' Align people, competencies, processes and practices with strategy
-' Prepare clients for the next big thing in Information Management and Analytics
-' Help IT suppliers to better align with the needs and nature of clients and prospects
-' Help clients capitalise on tangible benefits derived from advanced information architectures and management

Primary Sidebar

E-mail Newsletter

Sign up to receive email updates daily and to hear what's going on with us!

Publish
AN Article
Submit
a press release
List
AN Event
Create
A Job Post

Related Articles

How BlaBlaCar Built a Practical Data Mesh to Support Self-Service Analytics at Scale

March 23, 2023 By Barr Moses

The need for extensive data to make decisions more effectively and quickly

March 23, 2023 By Rosalind Desai

A Beginner’s Guide to Reverse ETL: Concept and Use Cases

March 22, 2023 By Tehreem Naeem

Related Jobs

  • Software Engineer | South Yorkshire, GB - February 07, 2023
  • Software Engineer with C# .net Investment House | London, GB - February 07, 2023
  • Senior Java Developer | London, GB - February 07, 2023
  • Software Engineer – Growing Digital Media Company | London, GB - February 07, 2023
  • LBG Returners – Senior Data Analyst | Chester Moor, GB - February 07, 2023
More Jobs

Tags

AI Amazon analysis analytics app application Artificial Intelligence BI Big Data blockchain business China Cloud Companies company costs crypto Data development digital environment experience finance financial future Google+ government information machine learning market mobile Musk news public research security share skills social social media software startup strategy technology twitter

Related Events

  • 6th Middle East Banking AI & Analytics Summit 2023 | Riyadh, Saudi Arabia - May 10, 2023
  • Data Science Salon NYC: AI & Machine Learning in Finance & Technology | The Theater Center - December 7, 2022
  • Big Data LDN 2023 | Olympia London - September 20, 2023
More events

Related Online Courses

  • Webinar – How to harness financial data to help drive improved analytics and insights with Envestnet & AWS
More courses

Footer


Datafloq is the one-stop source for big data, blockchain and artificial intelligence. We offer information, insights and opportunities to drive innovation with emerging technologies.

  • Facebook
  • LinkedIn
  • RSS
  • Twitter

Recent

  • How BlaBlaCar Built a Practical Data Mesh to Support Self-Service Analytics at Scale
  • How Blockchain Technology Can Enhance Fintech dApp Development
  • How to leverage novel technology to achieve compliance in pharma
  • The need for extensive data to make decisions more effectively and quickly
  • How Is Robotic Micro Fulfillment Changing Distribution?

Search

Tags

AI Amazon analysis analytics app application Artificial Intelligence BI Big Data blockchain business China Cloud Companies company costs crypto Data development digital environment experience finance financial future Google+ government information machine learning market mobile Musk news public research security share skills social social media software startup strategy technology twitter

Copyright © 2023 Datafloq
HTML Sitemap| Privacy| Terms| Cookies

  • Facebook
  • Twitter
  • LinkedIn
  • WhatsApp

In order to optimize the website and to continuously improve Datafloq, we use cookies. For more information click here.

settings

Dear visitor,
Thank you for visiting Datafloq. If you find our content interesting, please subscribe to our weekly newsletter:

Did you know that you can publish job posts for free on Datafloq? You can start immediately and find the best candidates for free! Click here to get started.

Not Now Subscribe

Thanks for visiting Datafloq
If you enjoyed our content on emerging technologies, why not subscribe to our weekly newsletter to receive the latest news straight into your mailbox?

Subscribe

No thanks

Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Necessary Cookies

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

Marketing cookies

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.

Please enable Strictly Necessary Cookies first so that we can save your preferences!