Dimensional modeling

Tracking historical changes within a dimension is a common task in data warehousing and well covered by Ralph Kimball’s slowly changing dimension (SCD) methods. In short, the SCD methods proposed by Ralph Kimball assume, that the source system (for example the ERP system) doesn’t keep historical versions of its data records, so changes need to be detected at the time when data is loaded into the warehouse. To keep historical values, versions of the master data records are created to memorize each state of the original record together with a valid from/to timestamp so that fact data can be joined to corresponding dimension data. But the valid from/to dates are usually not a good idea for joining fact data to the associated dimensions because this would result in range lookups (ETL) or date range (between) joins (in SQL or ELT). The surrogate key concepts offers a good solution here, by assigning a unique key (the surrogate key) to each version of a record. Now, this key can be used as a direct inner join from the fact table to its dimensions. This approach moves the time consuming process of resolving date ranges from query time to data loading time, so it has to be performed only once. Query performance now benefits from the simplified link structure between the tables.

However, there may be some cases, where you find valid from/to dates in the original source system. In this case, the historical values are provided by the source system and usually it’s not necessary for the data warehouse to track the changes. While this sounds to be much more simple than the case with missing validity dates, it’s usually a challenging situation, especially when past records (and their valid from/to dates) may be modified. For example, a given date range could be split or merged or the from and to dates may shift. In either case, the surrogate keys of some fact rows would point to the “wrong” dimension record afterwards. So, for these cases you will need to periodically reload parts of your data warehouse (for example the last three months) or in some rare cases track the changes and adjust the surrogate keys of the fact tables. I’m saying rare cases as update-operations on fact tables that are tuned for high volume bulk loads and bulk queries are usually not a good idea, so you may want to implement a partition-wise recreation of the fact table (partition switch operations) which adds some complexity to the overall workload management.

However, after this intro my post today is about a situation where you have several linked tables in the source system, all with a valid from/to date. You may find this situation for example in SAP’s human resources tables where the properties of an employee are stored in so called info types which are independently versioned by valid from/to date ranges. In this post, I’m using a much more simplified scenario with the following 4 tables:

Employee

Organizational Unit (OrgUnit)

Location

Company Car (Car)

The tables reflect a very simple human resources model of four tables, a base table Employee and three detail tables, all joined by the EmployeeNo-field. Each table may contain multiple versions of data and therefore each table has valid from/to fields to distinguish the versions. In my example I’m using the approach of an including ValidFrom and an excluding ValidTo. If you take a look at the first two rows of the OrgUnit table for example, this means that employee 100 was in the organizational unit “Delivery” from Jan 1, 2000 until December 31, 2013 and then starting with January 1 2014 in “PreSales”.
For each of the four tables, EmployeeNo together with ValidFrom forms a primary key.

One potential problem with such data is that since valid from/to are delivered from the source system, we need to make sure that these date ranges do not overlap. There might be scenarios where you need to deal with overlapping date ranges (for example, an employee may have none, one or many phone numbers at a given point in time, for example a cell phone and a land line). If you need to model such cases, many-to-many relations between fact and dimensional data may be a solution or you could move the information from the rows to columns of the new dimension table. But for this example, I will keep it simple, so we don’t expect overlapping data in our source tables.

However, it’s always a good idea to check incoming data for consistency. The following query for example checks if there are overlapping date ranges in the Employee table by using window functions to retrieve the previous and next date boundaries:

select * from (
select
    EmployeeNo
    , [ValidFrom]
    , [ValidTo]
    , lag([ValidTo],1) over (partition by [EmployeeNo] order by [ValidFrom]) PrevValidTo
    , lead([ValidFrom],1) over (partition by [EmployeeNo] order by [ValidFrom]) NextValidFrom
from Employee
) CheckDateRange
where (PrevValidTo is not null and PrevValidTo>ValidFrom) or (NextValidFrom is not null and NextValidFrom<ValidTo)

Please note, that this query does not check for gaps but only for overlapping date ranges in a table. If you like to detect gaps too, you’ll need to change the > and < in the where condition to a <>, i.e.

…where (PrevValidTo is not null and PrevValidTo<>ValidFrom) or (NextValidFrom is not null and NextValidFrom<>ValidTo)

Running this check on all the four tables from above shows that the data is consistent (no faulty rows returned from the query above).

Next, we can start to combine all of the four tables to a single dimension table. Let’s first show the final result:

The information of the four tables is now combined into a single table. Whenever an attribute changes this is reflected by the valid from/to date range. So for example, the first change for employee 100 was the company car at June 1, 2008.

So, how do we get there? At first, as the resulting valid from/to dates need to reflect all date ranges from all of the four tables, I start by collecting all of those dates:

with
ValidDates as
(
select EmployeeNo, ValidFrom as Date from Employee
union
select EmployeeNo, ValidTo from Employee
union
select EmployeeNo, ValidFrom from OrgUnit
union
select EmployeeNo, ValidTo from OrgUnit
union
select EmployeeNo, ValidFrom from Location
union
select EmployeeNo, ValidTo from Location
union
select EmployeeNo, ValidFrom from Car
union
select EmployeeNo, ValidTo from Car
)

This gives a list of all valid from/to-dates by employee from all of the four tables with duplicates being removed (since I used a union, not a union all). This is how the result looks like:

Next, I’m using this information to build the new valid from/to date ranges by using a window function to perform a lookup for the next date:

with
ValidDates as …
,
ValidDateRanges1 as
(
select EmployeeNo, Date as ValidFrom, lead(Date,1) over (partition by EmployeeNo order by Date) ValidTo
from ValidDates
)
,
ValidDateRanges as
(
select EmployeeNo, ValidFrom, ValidTo from ValidDateRanges1
where ValidTo is not null
)

Please note, that we already have the 10 resulting rows from the final result (see above) with the correct date ranges but without information from our four tables yet. So, now we can join the four tables with the date range table making sure to include the proper date range in the join condition. Here’s the resulting query:

with
ValidDates as …
, ValidDateRanges1 as …
, ValidDateRanges as …

select
      E.EmployeeNo
    , E.Name
    , E.EmployeeID
    , isnull(OU.OrgUnit,’unknown’) OrgUnit
    , isnull(L.Building,’unknown’) Building
    , isnull(L.Room,’unknown’) Room
    , isnull(C.CompanyCarId,’no company car’) CompanyCarId
    , D.ValidFrom, D.ValidTo
from Employee E
inner join ValidDateRanges D
on E.EmployeeNo=D.EmployeeNo and E.ValidTo>D.ValidFrom and E.ValidFrom<D.ValidTo
left join OrgUnit OU
on OU.EmployeeNo=D.EmployeeNo and OU.ValidTo>D.ValidFrom and OU.ValidFrom<D.ValidTo
left join Location L
on L.EmployeeNo=D.EmployeeNo and L.ValidTo>D.ValidFrom and L.ValidFrom<D.ValidTo
left join Car C
on C.EmployeeNo=D.EmployeeNo and C.ValidTo>D.ValidFrom and C.ValidFrom<D.ValidTo

Since we made sure that no date ranges are overlapping within a single table, the joins can only return at most one row per employee and date range. To deal with gaps (for example in the car table) I used the isnull-function here to replace the gaps with a meaningful value (for example ‘no company car’ or ‘unknown’).

One final remark: In most cases, the source tables may contain many more fields that are not relevant for the data warehouse. However, the valid from/to information reflects changes within these fields too. The above approach would result in more than necessary versions in this case. However, as long as your dimension does not get too big, this is not really bad. On the opposite, if you later decide to include more information from the source tables, you already have properly distinguished versions for this information so you do not need to correct fact rows afterwards. This could even make it a good idea to include valid from/to dates from other associated tables even if no other information from those tables is yet being used in the data warehouse.

But if your dimension gets too big with this approach, you could always ‘clean’ unnecessary version using a simple group-by select with min(ValidFrom) and max(ValidTo) grouping by all other columns.

So, this showed how to combine multiple tables into a single dimension. As mentioned above, you still need to create surrogate keys and if you cannot eliminate the need for past data changes, you will also need to handle.

by Hilmar Buchta

If you decide to upgrade from SSIS 2008 to SSIS 2012 you might decide to use the Project Deployment Model and deploy your SSIS packages to the SSISDB instead of using the Package Deployment Model and deploy your packages to the File System. The Project Deployment Model brings a lot of advantages, but also some issues that you have to solve.

One of these issues is how to call packages that are part of a different SSIS project. For example you have several SSIS projects, e.g. one project for the packages that load the dimensions, one project for fact packages, one project for workflow packages.

In SSIS 2008 Package Deployment Model to File System you can use Execute Package Tasks to control the execution order of your packages. If you want to do the same in SSIS 2012 you get invalid tasks.

In SSIS 2012 Project Deployment Model it is not possible to start e.g. the dimension packages from the workflow package with the Execute Package Task, because they belong to different projects.

To solve this issue you can use an Execute SQL Task, because when you execute packages on the server the execution information is inserted in the SSIS internal tables and then stored procedures are executed that run the SSIS package. To get the relevant code open SSMS, navigate to the folder in the SSIS catalog where your SSIS package is located, right-click on the package name and choose execute.

In the pop-up screen press the button „Script“. to get the code.

This generates the following code:
Declare @execution_id bigint EXEC [SSISDB].[catalog].[create_execution] @package_name=N'Load_DimProduct.dtsx', @execution_id=@execution_id OUTPUT, @folder_name=N'_TRDemo', @project_name=N'Dimension', @use32bitruntime=False, @reference_id=Null Select @execution_id DECLARE @var0 smallint = 1 EXEC [SSISDB].[catalog].[set_execution_parameter_value] @execution_id, @object_type=50, @parameter_name=N'LOGGING_LEVEL', @parameter_value=@var0 EXEC [SSISDB].[catalog].[start_execution] @execution_id

Copy this code to the Execute SQL Task that replaces the Execute Package Task.

The Execute SQL Task needs to connect to the SSISDB and execute the generated code.

If a Connection Manager to the SSISDB is not yet available then create it as a project connection.

Repeat this for all packages that you want to execute.

Execute the workflow package in the SSISDB and check the results in the execution reports of the workflow package and the packages that you call in the workflow package. You will see that the workflow package started the other packages and that the other packages were executed successfully.

In my next post I’ll explain how to make this solution more flexible. The current solution still has some disadvantages like redundant T-SQL code in every Execute SQL Task and it only works for exactly one environment reference.

by Thomas Rahier

Mit Data Factory hat Microsoft eine Komponente für Azure zur Verfügung gestellt mit der sich ETL-Prozesse in die Cloud verlagern lassen. Aber ist Azure Data Factory so etwas ähnliches wie SSIS aus dem SQL Server? Microsoft liefert folgende Definition:

Azure Data Factory is a managed service that you can use to produce trusted information from raw data in cloud or on-premises data sources. It allows developers to build data-driven workflows (pipelines) that join, aggregate and transform data sourced from their local, cloud-based and internet services, and set up complex data processing logic with little programming.

Möglichkeiten der Modellierung

Zunächst ist die Anzahl der verwendbaren Objekte in Azure Data Factory begrenzt. Es gibt Linked Services, Tables und Pipelines. Linked Services beschreiben eine Datenquellen (z.B. Azure Storage, Azure SQL Databases, On-Prem SQL Server Databases, HDInsight). Hier eine Beispieldefintion für einen Linked Services zu einen SQL Server On-Premise:

{
    "name": "MyOnPremisesSQLDB",
    "properties":
    {
        "type": "OnPremisesSqlLinkedService",
        "connectionString": "Data Source=<servername>;Initial Catalog=MarketingCampaigns;",
        "gatewayName": "<gateway name>",
        "gatewayLocation": "westus"
    }
}

Tables sind analog zu Tabellen in einer relationalen Datenbank und beschreiben die Struktur und wie und wo die Daten abgelegt sind. In dieser Definition wird eine relational abgelegt Tabelle referenziert.

{

    "name": "MyOnPremisesSQLServer",
    "properties":
    {
        "location":
        {
            "type": "OnPremisesSqlServerTableLocation",
            "tableName": "MyTable",
            "linkedServiceName": "MyLinkedService"
        },
        "availability":
        {
            "frequency": "Hour",
            "interval": 1
        }
    }
}

Eine Pipeline enspricht am ehesten einem Dataflow aus SSIS. Eine Pipeline besteht aus Activities. Activities wiederum können Daten ähnlich wie Data Flow Components in SSIS transformieren. Hier ein Beispiel:

Scripting

Das Datenformat für Data Factory ist JSON. JSON (JavaScript Object Format) ist ein kompaktes Datenformat in einer einfach lesbaren Textform. Hier ein Beispiel für eine Pipeline:

{
    "name": "PipelineName”,
    "properties": 
    {
        "description" : "pipeline description",
        "activities":
        [
        ],
    }
}

Um zur Laufzeit String-Properties setzen zu können werden weitere Funktionen benötigt. Konkret heißen diese Funktionen Data Factory Functions. Die Anzahl der Funktionen im jetzigen Bestand sind relativ überschaubar. Hier ein kleines Beispiel in dem zwei Data Factory Functions (Text.Format und Time.AddHours) verwendet werden um eine SQL-Abfrage zu erstellen.

{
    "Type": "SqlSource",
    "sqlReaderQuery": "$$Text.Format('SELECT * FROM MyTable WHERE 
         StartTime = \\'{0:yyyyMMdd-HH}\\'', Time.AddHours(SliceStart, 0))"
}

Status der Implementierung

Stand Heute ist Azure Data Factory noch in der Erprobungsphase. Es gibt lediglich zwei Activities mit denen Daten kopiert werden können. Momentan gibt es auch keine integrierte Entwicklungsumgebung, daher müssen Pipelines mit einem Texteditor im JSON Format erstellt werden. Nichtsdestotrotz können bereits heute komplizierte Transformationen unter zur Hilfe nahme von Pig und Hive erstellen werden. Es bleibt abzuwarten wie sich das Produkt in seiner finalen Phase zeigen wird.

by Daniel Esser

In my last post I explained how to start a SSIS 2012 package that is located in a different SSIS project.

However, this solution is not very flexible, because the code is always only valid for one package and you have to repeat all the steps again and again. In case of errors this could mean a lot of work to correct all Execute SQL Tasks. And because of a fixed environment reference it is nearly impossible to deploy the packages from e.g. the development server to production server.

The solution is to create a user-defined stored procedure in the SSISDB that contains all the code and provides the flexibility that you need. Add the following parameters to the procedure
– PackageName = The name of the package that you want to execute
– FolderName = The name of the folder where your project is stored
– ProjectName = The name of the SSIS project the package belongs to
– EnvironmentName = The name of the SSIS environment that you want to use
– ExecuteSync = A flag to control if the package is executed synchrnously (=1, my recommendation) or not (=0)
– LoggingLevel = To set the SSIS logging level (default should be 1)

The procedure first has to select the proper enviroment reference from the SSISDB tables internal.folders, internal.projects and internal.environment_references using the parameters FolderName, ProjectName, EnvironmentName.

Next use SSMS to create the T-SQL code to start a SSIS package on the server (as described in part 1), copy the code to the stored procedure and modify the code to use the parameters of the stored procedure.

Last you should check the execution status of the SSIS execution. If the SSIS package did not run successfully and you chose synchronous execution you can raise an error to stop the workflow package.

In the Execute SQL Task call the stored procedure and assign appropriate values to the parameters to execute the packages.

Execute the workflow packages and check the execution results to see that the packages were executed.

by Thomas Rahier

Azure | SQL Server 2014

With SQL Server 2014 it’s easy to move database files to the Azure Blog storage even if the SQL Server runs on premise. Azure Blob storage offers reliable, cheap and high available storage, which could be useful for “cold” data for example.

However, configuration is a little bit tricky, so I’m going to walk through this process step by step.

1. Create an Azure Blob store account and a container

Log into Azure and create a new storage account. For my example, I’m using “db4” as the name as shown below:

Next, I’m going to create a blob store container, which I name “data” here:

In order to access the container, we need the URL to the container (db4.core.windows.net/data in my example) and the storage key. The key can be obtained by clicking on “Manage Access Keys” on the bottom of the screen:

You can copy the key to the clipboard by clicking on the icon right besides the Primary Access Key box.

For the next task I’m using Windows Azure Storage Explorer (download here). Here you can add your storage account by pasting the access key into the storage account key input box:

2. Create a Shared Access Signature for the container

In Azure Storage explorer, select the container (data) and click on ‘Security’:

This brings up the following dialog. Make sure to select the permissions list, delete, read and write. After clicking on ‘Generate Signature’ a shared access signature is created. Copy this signature to the clipboard.

3. In SQL Server: Create a credential for the blob container

In SQL Server we’re using the create credential statement to create a credential for the blob store. Make sure to replace the secret key with the generated shared access signature from the last step (I just obfuscated the key by overwriting part of the key with ‘x’):

CREATE CREDENTIAL [https://db4.blob.core.windows.net/data]
WITH IDENTITY=’SHARED ACCESS SIGNATURE’,
SECRET = ‘sv=2014-02-14&sr=c&sig=c%2Fxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx3%3A00%3A00Z&se=2014-12-31T23%3A00%3A00Z&sp=rwdl’

If you like, you can check the credentials with “select * from sys.credentials”:

4. In SQL Server: Create a database that uses the blob container

The next step is simple. We just need to create a database using the container as its storage:

CREATE DATABASE testdb
ON
( NAME = testdb_dat,
FILENAME = ‘https://db4.blob.core.windows.net/data/TestData.mdf’ )
LOG ON
( NAME = testdb_log,
FILENAME = ‘https://db4.blob.core.windows.net/data/TestLog.ldf’)

You can create tables and load data in just the same way as you would do with a local database file. Azure Storage Explorer lists the database files that are created:

5. Optional: Register the blob store in SQL Server Management Studio

You can register the blob store in SQL Server Management Studio by creating a connection to Azure:

The “Access Key” is the key we created in the first step and can simply be copied into the account key field:

After connecting to the Azure blob store, Management Studio shows our container together with the database files:

Of course, when placing database files on Azure, a connection is needed to the blob store. If you don’t have this connection, you will not be able to access the database:

Summary

With SQL Server 2014 it is easy to put data files on an Azure storage account even for an on premise SQL Server. Use cases include

store data that is not heavily queried
store data that you want to secure in a geo-redundant way
enhance the local storage of a SQL Server
perform a single table backup to the cloud
… and many more

by Hilmar Buchta

In this article I would like to show you how to generate test data with a script component in SSIS. You often construct a data flow in SSIS and want to test it with sample data. Creating appropiate test data can be sometimes very time consuming. Instead of connecting with a data source connection to a database one good way to create test data is using the script component as a source in a data flow.

In this example I show you how to construct mobile numbers when using a script component as a source in SSIS.

Drag a data flow task from the SSIS toolbox to the control flow designer pane. If you want, you can rename it.

Go to the Data Flow pane and drag the script component from the SSIS toolbox to the data flow designer pane. A pop up window appears where you can select the script component type. Here you choose Source.

Then you can rename your script component if you want.

Then open the script component. In the next step you have to define the output columns of the script component. On the left side of the window you click on Inputs and Outputs. Then you click on Output Colums of Output 0. Now you can add columns to the Output 0.

In this example the first column is called Id with the data type DT_I4. The second column is called MobileNumber. Here the DataType is changed to DT_WSTR.

Now you choose on the left side Scripts in order to click on the Edit scripts button. A new window opens where you can write some custom code. Add the following lines to the function CreateNewOutputRows:

The code uses a loop that generates 1000 rows for the output columns Id and MobileNumber. Please note that MobileNumber is a string. The string starts with „491234“. This is concatenated with the current value of i. The zeros behind the colons represent leading zeros. For example for i=17 you have the value 00017.

Now in order to see the data, drag a derived column component from the SSIS toolbox to the data flow designer pane and link them with the script component. Then click on the link and choose data viewer. Your data flow designer pane should look like this:

Now you can start the SSIS package and the data generated is shown below:

Now you can develop your data flow task using the generated test data.

by André Kienitz

I’ll show how permissions in SSIS SQL Server 2012/2014 are set, managed and queried.

With SQL Server 2012 Microsoft introduced a new Project Deployment Model with a dedicated SQL Server Database.
DTSX Packages are no longer deployed into the MSDB but in the SSISDB.

It’s a database, so you can manage security as in any other database but you also have the chance to set the security for every folder, project or environment.

Every Permission you set on an item like folder, project or environment is stored in the SSISDB in the
corresponding table. In the screenshot you can see the example for folders and the corresponding “folder_permissions” table.

folder permission

For this example I created two users (“Environment_Changer” and “SSIS_Executor001”).
We will give them permissions corresponding to their names. I created two different folder for two different Integration Services
projects and in each folder there are two different environments.

catalog folder structure

For the first project we will set the permissions on folder level on the second project we will explicitly set
the level on the environment to show the different levels of permissions.

The next screenshots show the effective permissions set via Management Studio:

permissions 01 folder 01

As you can see the user “Environment_Changer” has granted the permissions to read and change the objects in the project folder.

permissions 02 folder 01

The SSIS_Executor001 is allowed to read and execute the objects in the folder but there is no grant to modify.

permissions 01 folder 02

In the second folder I gave permissions on the Environment-Level only. User “Environment_Changer” is allowed to read and modify objects.

permissions 02 folder 02

And here the SSIS_Executor001 has an explicit deny on the modify permission.

This we can see and manage in the GUI. For every object there is a corresponding table for permissions.

For our first example with the folder permission we can use the following two queries to get a list of folders
and permissions set on the folder level:

SELECT TOP 1000
[folder_id],
[name],
[description],
[created_by_sid]
FROM
[SSISDB].[internal].[folders]

SELECT TOP 1000
[id],
[sid],
[object_id],
[permission_type],
[is_deny],
[grantor_sid]
FROM
[SSISDB].[internal].[folder_permissions]

query results 01

We do have two folders and we can query the permissions. As all kind of information is presented by numbers I
wrote a query to have it more readable. In this query I get the folder and the environment permissions. I intentionally used the “union all” to give the
chance to use only parts of the query and have to comprehend a recursive query.

USE [SSISDB]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

create table #Permisiontypes(
id smallint,
description nvarchar(50)
)

insert into #Permisiontypes values
(1,‘READ’),
(2,‘MODIFY’),
(3,‘EXECUTE’),
(4,‘MANAGE_PERMISSIONS’),
(100,‘CREATE_OBJECTS’),
(101,‘READ_OBJECTS’),
(102,‘MODIFY_OBJECTS’),
(103,‘EXECUTE_OBJECTS’),
(104,‘MANAGE_OBJECT_PERMISSIONS’)

/*
Source of Info:
http://msdn.microsoft.com/en-us/library/ff878149.aspx */

SELECT [Object] = ‘Folder’,
Foldername = fold.name,
Environmentname = NULL,
[permission_description] = #Permisiontypes.description,
Principals.Name,
[is_deny]
FROM [internal].[object_permissions] ObjPerm

join sys.server_principals Principals

on ObjPerm.sid = Principals.sid

join #Permisiontypes

on ObjPerm.permission_type = #Permisiontypes.id

join internal.folders fold

on fold.folder_id = ObjPerm.object_id

where object_type = 1

union all

SELECT [Object] = ‘Environment’,
Foldername = fold.name,
Environmentname = env.name,
[permission_description] = #Permisiontypes.description,
Principals.Name,
[is_deny]

FROM [internal].[object_permissions] ObjPerm

join sys.server_principals Principals

on ObjPerm.sid = Principals.sid

join #Permisiontypes

on ObjPerm.permission_type = #Permisiontypes.id

join [catalog].[environments] env

on ObjPerm.object_id = env.environment_id

join catalog.folders fold

on env.folder_id = fold.folder_id

where object_type = 3

order by Object desc,Foldername,
Principals.name,
permission_description

drop table #Permisiontypes

This gives the following result:

query results 02

We can find all the permissions including the explicit deny we already saw in the management studio GUI.

This result can now be used in different ways. One idea could be two store it somewhere and after some month
to compare it with the current state.
So we have lightweight solution to audit the object permissions in the SSISDB.

by Stefan Grigat

In a Data Vault Model, Business Keys are used to integrate data. A Hub Table contains a distinct list of Business Keys for one entity. This makes the Hub Table the “Key Player” of the Data Vault Data Warehouse.This blog post explains the pattern for loading a Hub Table. Moreover I will explain the differences between Data Vault 1.0 Hub Loads and Data Vault 2.0 Hub Loads.

Data Vault 1.0 Hub Load

The function of a Hub Load is to load a distinct list of new Business Keys into the Hub Table. In Data Vault 1.0 it also generates a Sequence as Surrogate Key. Previously loaded Business Keys will be dropped from the Data Flow.

Example

To explain Hub Loads I am using a well-known example from the Adventure Works 2012 Data Base. The example load is supposed to load product information into the Data Vault Model.

The destination table Product_Hub has the following structure.

Column Product_Seq
Surrogate Key for the Product

Column Product_Bk
Business Key of the Product – In this case we use the Product Number. Hub Tables have at least one column for the Business Key. When composite Business Keys are needed there can be more.

Column LoadTimestamp
Time of the Load – This column indicates the data currency.

Column LoadProcess
ID of the ETL Load that has loaded the record

Column RecordSource
Source of the Record – At least it needs the source system. But also you can add more specific values like the source table. The column could also contain the name of a business unit, where the data was originated. Choose what suites your requirements.

The Data Source is the table Product.Product. This data will be loaded without any transformation into the Stage Table Product_Product_AdventureWorks2012. The Stage Load is performed by a SQL Server Integration Services Package.

Product (Source Table) >> Product_Product_AdventureWorks2012 (Stage Table)

From the Stage, the Hub Load Pattern is used to load the data into the destination table.

Product_Product_AdventureWorks2012 (Stage Table) >> Product_Hub (Destination Table)

The following T-SQL and a SSIS implementations will illustrate the concept of Hub Loads.

T-SQL – Implementation

In this example a stored procedure Load_Product_Hub represents the Hub Load Pattern. To generate the Surrogate Sequence Number, the procedure is using a SQL Server Sequence.

Creating a Sequence in T-SQL

The stored procedure Load_Product_Hub shows how the load pattern could be implemented in T-SQL.

The command NEXT VALUE FOR gets the new Sequence Number for each row.

SQL Server Integration Services – Implementation

When we model Data Warehouse Loads, we want to use an ETL Tool like Microsoft SQL Server Integration Services.

This sample shows how the pattern can be implemented in Integration Services.

Generation of a Surrogate Sequence Key when using Integration Services

There are many ways to generate keys within SSIS. You can write your own Key Generator in C# or apply a 3th Party Vendor – Generator. The database itself has some capabilities to generate keys. It is common to use an Identity Column to generate keys. But if you truncate the table the Identity Column will be reset. This can depending on your ETL Architecture cause duplicate key issues.

Recommendation

To avoid this behavior I recommend using a Sequence in a Default Constrain on the destination table. The benefits of a Sequence is that you do not have to deal with Key Generation within your ETL. Leave this task to the database. Keys are independent from the loading process. This makes it possible to exchange the ETL Process or the ETL Tool if that is necessary. The Sequence keeps incrementing also when the table gets truncated.

Using the Sequence as a Default Constrain on the destination table

Every new row will get a new Sequence Number by default. But still you can set the value within your ETL Process, if that is needed.

Data Vault 2.0 Hub Load

The main difference between a Hub Load in Data Vault 1.0 and Data Vault 2.0 are Hash Keys.

Hash Keys are a controversial topic. It is true that collisions can occur when using Hash Keys. On the upside, using Hash Keys can solve issues with late arriving data. More importantly they can increase the load performance of the data model significantly. They enable to load the data full parallel into the Data Vault Model. This can be achieved because the Business Keys are the base for the Hash Keys.

But pros and cons of Hash Keys should not be the matter of this article. In later posts we will investigate how Data Warehouse Loads can benefit from Hash Keys.

The load pattern for loading a Data Vault 2.0 Table is basically the same like in Data Vault 1.0. Just the Surrogate Sequence Key Generator gets replaced by a Hash Generator.

In Data Vault 2.0 Hash Keys replacing sequence keys. Therefore we have to modify our data model a little. The Product_Seq column has to be replaced by a column Product_Hsk. Using different suffixes here helps to differentiate Data Vault 1.0 and Data Vault 2.0 tables. The Data Type of the column has to be changed as well. It is recommended to use a Char (32) field to store a MD5 Hash Key.

T-SQL – Implementation

Generating a MD5 Hash Key in T-SQL

To implement the modified Hub Load Pattern a Hash Key Generator is needed. In T-SQL I have implemented a custom function that returns the Hash Key of a given Business Key. The function is using the SQL Server HASHBYTE function to generate the Hash Key.

In the Procedure Load_Product_Hub the “SQL Server Sequence – Call” has been substituted by a call of the new Hash Generator function “GetHashKey”.

SQL Server Integration Services – Implementation

In SSIS the Data Flow Task has to be extended by a Script Component that generates the Hash Key.

Generating a MD5 Hash Key in Integration Services

Within the Data Flow Script Component “Generate Hash Key” I added the following C# script.

The script is using the System.Security.Cryptography.MD5CryptoServiceProvider to build the MD5 Hash.

Conclusion

The shown implementations are examples to explain how a Hub Load works. Each individual project will require individual implementations of these patterns.

A Hub Load is a simple pattern, which can be easily repeated for every entity in the data model. This is one reason, which makes the Data Vault Model so scalable.

Because Data Vault Loads are standardised, they can be generated and developed with a high degree of automation. As a result, Enterprise Data Warehouse Projects can be developed more agile and fast.

by Daniel Piatkowski

Als ich vor kurzem bei einem Kunden einen Workshop zum Thema Datenmodellierung und Datenbewirtschaftung mit SSIS durchführte, entwickelten wir ein kleines Star-Schema und bewirtschafteten die Dimensions- und Faktentabellen. Dabei ließ ich den Kunden alles selbst machen und beschränkte mich auf die Rolle des Trainers. Das klappte auf Anhieb sehr gut, und schnell war das Star-Schema mit Daten gefüllt, die schon produktiv ausgewertet werden konnten. Dabei kam der Kunde auf den Geschmack und wollte das Modell um zusätzliche Informationen, die sich in Form neuer Attribute in den Dimensionstabellen niederschlugen, anreichern. Das Datenmodell entsprechend zu erweitern und die Änderungen auf der Datenbank nachzuziehen war schnell erledigt und auch die nötige Anpassung der SSIS-Packages, mit denen die Zieltabellen bewirtschaftet wurden, war kein Problem, und die Daten konnten neu geladen werden. Dabei stellte sich aber heraus, dass nun mehr Fakten geladen wurden als vorher, ohne dass sich an den zugrundeliegenden Quelldaten etwas geändert hatte.

Die Ursache war schnell gefunden: Wir hatten ein kartesisches Produkt erzeugt, indem wir eine neue Tabelle hinzugejoint hatten!

Diese neue Tabelle enthielt die zusätzlichen Attribute. Die benötigten Informationen konnten jedoch in zwei Spalten der neuen Tabelle zu finden sein, was unproblematisch gewesen wäre, wenn sich die Spalten nicht auch in unterschiedlichen Zeilen befunden hätten, denn es galt folgende Mimik:

Wenn Spalte 1 (MM1) in Zeile 1 gefüllt ist, dann soll der Wert aus Spalte 1 (MM1) verwendet werden, ansonsten soll der Wert aus Spalte 2 (MM2) der Zeile 2 verwendet werden. Es ist immer nur MM1 oder MM2 gefüllt und nicht für jeden Wert (Join-Spalte) gibt es zwingend auch zwei Zeilen.

Die möglichen 2 Zeilen waren der Grund dafür, dass wir kartesisch geworden waren, denn die obige Mimik wurde uns erst klar als wir uns die neue Tabelle näher angeschaut hatten. Wir standen also vor der Herausforderung den richtigen Wert aus MM1/Zeile 1 oder MM2/Zeile 2 zu ermitteln ohne kartesisch zu werden.

Um den Sachverhalt an einem einfachen Beispiel zu illustrieren erzeugen wir uns eine kleine Beispieltabelle:

create table SIMULACRUM.dbo.tab_x (abc_id integer identity(1,1) not null,
 xyz_id integer not null,
 MM1 nvarchar(20) null,
 MM2 nvarchar(20) null,
 blah nvarchar(20) null,
 blubb nvarchar(20) null);

In diese Tabelle stellen wir nun ein paar Datensätze ein, die die Problematik abbilden:

insert into <DB-Name>.dbo.tab_x (xyz_id, MM1, MM2, blah, blubb) values (1, NULL, NULL,'blah','blubb');

insert into <DB-Name>.dbo.tab_x (xyz_id, MM1, MM2, blah, blubb) values (1, NULL, 'Y','blah','blubb');

insert into <DB-Name>.dbo.tab_x (xyz_id, MM1, MM2, blah, blubb) values (2, 'X', NULL,'blah','blubb');

insert into <DB-Name>.dbo.tab_x (xyz_id, MM1, MM2, blah, blubb) values (2, NULL, 'M','blah','blubb');

insert into <DB-Name>.dbo.tab_x (xyz_id, MM1, MM2, blah, blubb) values (3, 'Z', NULL,'blah','blubb');

insert into <DB-Name>.dbo.tab_x (xyz_id, MM1, MM2, blah, blubb) values (3, NULL,'A','blah','blubb');

Jetzt sieht unsere Beispieltabelle etwa so aus:

Die Spalte xyz_id ist die Spalte, über die der Join erfolgen soll. Für jeden Wert von xyz_id (Joinspalte) soll nur eine Zeile zurückkommen.

Nach der o.a. Mimik würden wir also folgendes Ergebnis erwarten:

xyz_id	Wert
1	Y
2	X
3	Z

Diese Ergebnis erhalten wir, indem wir prüfen ob MM1 oder MM2 den Wert NULL aufweist und den jeweils gefüllten Wert (not null) auswählen und nach xyz_id gruppieren. Den selektierten Wert müssen wir mit einer Aggregatfunktion (z.B. MIN() oder MAX())behandeln, damit wir tatsächlich nur einen Satz für jede Ausprägung von xyz_id bekommen:

select      xyz_id,

            min(case when MM1 is null then MM2 else MM1 end) as xyz_min,

            max(case when MM1 is null then MM2 else MM1 end) as xyz_max

from        <DB-Name>.[dbo].[tab_x]

group by xyz_id

Die Anwendung von MAX() führt in diesem einfachen Beispiel – zufällig – zum richtigen Ergebnis (Spalte xyz_max):

Aber auf den Zufall kann man sich bekanntlich nicht verlassen. Daher benötigen wir ein Verfahren, mit dem es immer klappt. Dazu verwenden wir einen kleinen Trick und konkatenieren 1 vor den Wert, wenn er aus MM1 stammt und 2, wenn er aus MM2 stammt. In Verbindung mit MIN() erhalten wir so immer den richtigen Wert:

with q as (

select      xyz_id,

            min(case when MM1 is null then '2'+MM2 else '1'+MM1 end) xyz

from        <DB-Name>.[dbo].[tab_x]

group by xyz_id

select      q.xyz_id,

            substring(q.xyz,2,1) as Wert

from  q

Die 1 bzw. 2 entfernt man dann im Nachgang einfach wieder und schon kann man ohne das Risiko, ein kartesisches Produkt zu erzeugen, das Ergebnis des obigen SQL-Statements in einem Join verwenden.

by Jörg Menker

BI Innovation Lab: der Kunde forscht mit

Business-Intelligence-Technologien sind komplex. Und sie entwickeln sich rasant. Als Entscheider verliert man dabei schnell den Überblick. Welche Themen sind für meine Branche relevant? Und wie lassen sich aktuelle Entwicklungen für mein Unternehmen optimal nutzen? Eine Antwort gibt das Innovation Lab von Oraylis.

Mobile BI, Operational BI, Self-Service-BI, Cloud-BI, Social Media Analytics – die Liste aktueller Trends im Business-Intelligence-Bereich lässt sich beinahe beliebig verlängern. Und der Anschein trügt nicht: Die Innovationsgeschwindigkeit nimmt ständig zu. Wer nicht Schritt hält, der verliert schnell den Anschluss. Das gilt nicht nur für die Unternehmen. Gerade externe Dienstleister müssen Mittel und Wege finden, um ihrem Beratungsanspruch dauerhaft gerecht zu werden.

Wie sich entsprechende Prozesse erfolgreich etablieren lassen, zeigt das Beispiel von Oraylis: Einerseits wird beim Kunden vor Ort ein Kompetenzcenter eingerichtet, das aus eigenen Experten sowie Vertretern des Unternehmens besteht. Hier werden laufend die strategische Ausrichtung und fachlichen Fragestellungen erörtert sowie deren technische Machbarkeit abgewägt. Das Ergebnis sind konkrete Handlungsempfehlungen. Andererseits gibt es das sogenannte Innovation Lab: Die interne Forschungseinrichtung von Oraylis stellt kontinuierlich die neuesten Errungenschaften des BI-Bereichs auf den Prüfstand. Auf diese Weise wird sichergestellt, dass die Kundenlösungen stets das Potenzial neuer Technologien in vollem Maße ausschöpfen. Bei Bedarf hat der Kunde auch die Möglichkeit, sich an den Testläufen direkt zu beteiligen, um den etwaigen Mehrwert für die eigene Situation zu klären.

Die Workshops des Innovation Labs finden in informeller Atmosphäre statt. Kunden sind mit ihren branchenspezifischen Fragestellungen jederzeit willkommen.

Spielwiese für neue Erkenntnisse

Die Forschungstätigkeiten des Innovation Labs finden im Rahmen gemeinsamer Experten-Workshops statt. Jedes Mal steht eine spezifische Methodik oder Technologie im Mittelpunkt. Den Teilnehmern wird in diesem Kontext eine virtuelle Umgebung als „Spielwiese“ zur Verfügung gestellt, die bereits ein gewisses Starter-Know-how enthält. Dadurch wird gewährleistet, dass die Schwelle zu neuen Entdeckungen und Erkenntnissen möglichst gering ist. Einzelne Teams bearbeiten daraufhin unterschiedliche Aspekte und Fragestellungen zum jeweiligen Thema. Schließlich wird das gewonnene Wissen zusammengeführt und in Best-Practice-Methoden sowie passende Werkzeuge übersetzt.

Die Impulse für neue Themen kommen aus ganz unterschiedlichen Richtungen. Erste Quelle ist selbstverständlich der aktuelle Markt- und Branchendiskurs, der gegenwärtig beispielsweise Hadoop als Basis künftiger BI-Backends zunehmend in den Fokus rückt. Ebenso geben für Oraylis als Microsoft Gold Partner die zukunftsweisenden Entwicklungen des Herstellers immer wieder Anlass zu weiterführenden Testläufen. Aktuellstes Beispiel ist die „Cloud-first-Mobile-first“-Strategie, die eine wahre Innovationswelle ausgelöst hat. Nicht zuletzt können die Anstöße auch direkt aus der Forschung und Lehre stammen. Unter anderem pflegen die Innovation-Lab-Betreiber eine enge Kooperation mit dem Lehrstuhl für Wirtschaftsinformatik der Universität Köln.

Von Facebook bis Vorhersagen

Vielfältig wie die Quellen, sind auch die Erkenntnisse für die BI-Praxis, die aus dem Innovation Lab hervorgehen. Gemeinsam mit der Firma Insius, einem kommerziellen Ableger des Kölner Lehrstuhls, wurde etwa ein effizienter Self-Service-BI-Ansatz für die Analyse von Facebook-Daten etabliert. Wie sich gezeigt hat, bietet Excel für Power BI einen Adapter, mit dessen Hilfe sich Facebook-Inhalte schnell und einfach in einen In-Memory-Cube laden lassen. Durch Anreicherung über Power Query sowie eine automatisierte Inhaltsbewertung lassen sich die Daten schließlich auswerten und aussagekräftig visualisieren.

Auch die Cloud-Technologie von Microsoft hat inzwischen ein breites Forschungsfeld mit hohem Erkenntnispotenzial eröffnet. Ein Fokus liegt dabei auf dem Trend-Thema „Predictive Analytics“ – sprich: Zukunftsprognosen auf Basis vorhandener Daten. An den entsprechenden Workshops partizipieren auch gerne Kunden. Denn die Anwendungsszenarien sind vielfältig und für die verschiedensten Branchen relevant: Sie reichen von „Churn“-Analysen, also gezielten Vorhersagen zur Wechselbereitschaft einzelner Kunden, über die Vorbeugung von Betrugsversuchen in Form der „Fraud Prevention“ bis hin zur „Next-Best-Offer“, die Vorschläge für passende Kundenangebote bereitstellt.

Eine zukunftsweisende Lösung für die vorausschauende Wartung und Instandhaltung von Maschinen brachte die Beteiligung eines führenden Herstellers von Getränkeabfüllanlagen mit sich. Die zunehmend automatisierten Produktionsstrecken des Maschinenbauers liefern laufend Unmengen an Sensordaten, die bis zu dem Zeitpunkt weitestgehend ungenutzt blieben. Durch die Kombination von Hadoop, Power BI sowie dem Cloud-Machine-Learning-Dienst wurde im Rahmen des Innovation Labs eine Möglichkeit geschaffen, die Datenflut günstig abzuspeichern und für In-Memory-Analysen aggregiert bereitzustellen. Auf diese Weise lassen sich nunmehr fundierte Aussagen über ideale Wartungsintervalle treffen, die letztlich das Prototyping des Herstellers nachhaltig unterstützen. Als besonders hilfreich erweist sich dabei, dass die Cloud-Lösung von Microsoft den in Deutschland bevorzugte, hybriden Datenzugriff unterstützen.

Aktuelles Zukunftsthema „Big Data“

Ein Forschungsbereich der kommenden Workshops werden Technologien rund um das Zukunftsthema „Big Data“ sein. Mit dem Analytics Platform System (APS) bietet Microsoft inzwischen eine vorkonfigurierte und besonders leistungsfähige Lösung, die eine Hadoop-Plattform in den klassischen Data-Warehouse-BI-Stack integriert. So können selbst extrem große Mengen relationaler Daten bei kurzen Antwortzeiten verarbeitet werden. Was indes die Vorgehensweise bei der Datenablage angeht, befindet sich der BI-Markt noch in der Findungsphase – heißt: Welche Anforderungen können und sollten auf Hadoop abgebildet werden? Und welche verbleiben im relationalen Data Warehouse? Mit PolyBase verfügt das APS über eine Technologie, durch die sich die unstrukturierten Massendaten des Hadoop-Clusters nahtlos, transparent und vor allem flexibel verknüpfen lassen. Infolgedessen benötigt der Fachanwender keinerlei Kenntnisse über den genauen Lagerplatz der Daten. Er kann also beispielsweise Umsatzdaten zu einzelnen Produkten aus der relationalen Datenbank problemlos mit entsprechenden Stimmungslagen aus Social-Media-Kanälen verbinden, die auf Hadoop abgelegt sind.

Insofern eröffnen APS und PolyBase ein ebenso weites wie komplexes Feld an BI-Szenarien und -Spielarten, die es im Rahmen des Innovation Labs zu erkunden gilt. Kunden werden auch in diesem Fall wieder gerne bei den Workshops begrüßt – denn letztlich sind es erst die branchenspezifische Fragestellungen und der Mehrwert für das jeweilige Kerngeschäft, die das Potenzial neuer Technologie erst in vollem Maße hervorbringen.

by Jens Kröhnert

Öfter als einem lieb ist steht man vor dem Dublettenproblem, bei dem (meist fälschlicherweise) der gleiche Datensatz mehrfach in einer Tabelle auftaucht. Wenn man nicht die komplette Tabelle löschen und neu laden will oder kann, müssen die Dubletten auf andere Art und Weise eliminiert werden. Im Falle einer Tabelle mit einem Primärschlüssel (PK) unterscheiden sich die doppelt oder mehrfach vorhandenen Datensätze durch einen unterschiedlichen PK, ansonsten nicht. Das ist ein guter Ansatz um die Dubletten los zu werden nach der Mimik Self-Join auf alle Spalten außer dem PK und unterschiedlicher PK, aber es gibt auch noch andere Möglichkeiten wie wir sehen werden.

Zur Verdeutlichung erstellen wir ein kleines Beispiel:

create table SIMULACRUM.dbo.dubtab1

      tab_id int not null,

      tab_bk nvarchar(20) not null,

      tab_bez nvarchar(50),

      tab_n1 nvarchar(50),

      tab_n2 nvarchar(50)

);

ALTER TABLE SIMULACRUM.dbo.dubtab1 ADD CONSTRAINT

      PK_dubtab1 PRIMARY KEY CLUSTERED

      tab_id

);

In die erstellte Beispieltabelle stellen wir jetzt ein paar Datensätze mit Dubletten ein:

insert into SIMULACRUM.dbo.dubtab1 values (1,'001','Bezeichnung 1','blah1','blubb1');

insert into SIMULACRUM.dbo.dubtab1 values (2,'002','Bezeichnung 2','blah2','blubb2');

insert into SIMULACRUM.dbo.dubtab1 values (3,'003','Bezeichnung 3','blah3','blubb3');

insert into SIMULACRUM.dbo.dubtab1 values (4,'004','Bezeichnung 4','blah4','blubb4');

insert into SIMULACRUM.dbo.dubtab1 values (5,'005','Bezeichnung 5','blah5','blubb5');

insert into SIMULACRUM.dbo.dubtab1 values (6,'001','Bezeichnung 1','blah1','blubb1');

insert into SIMULACRUM.dbo.dubtab1 values (7,'002','Bezeichnung 2','blah2','blubb2');

insert into SIMULACRUM.dbo.dubtab1 values (8,'003','Bezeichnung 3','blah3','blubb3');

insert into SIMULACRUM.dbo.dubtab1 values (9,'004','Bezeichnung 4','blah4','blubb4');

insert into SIMULACRUM.dbo.dubtab1 values (10,'005','Bezeichnung 5','blah5','blubb5');

insert into SIMULACRUM.dbo.dubtab1 values (11,'005','Bezeichnung 5','blah5','blubb5');

Ein SELECT * FROM [SIMULACRUM].[dbo].[dubtab1] order by tab_bk ergibt folgendes erwartungskonformes Ergebnis:

Wie man sieht ist jeder Datensatz doppelt vorhanden bis auf tab_bk = ‚005‘, der sogar dreifach vorhanden ist, und unterscheidet sich nur durch den unterschiedlichen PK (tab_id).

Eine Möglichkeit die Dubletten zu löschen besteht in der Verwendung des Merge-Befehls:

merge into SIMULACRUM.dbo.dubtab1 as y

using (select     a.tab_id,

            a.tab_bk,

            a.tab_bez,

            a.tab_n1,

            a.tab_n2

from  SIMULACRUM.dbo.dubtab1 as a

inner join SIMULACRUM.dbo.dubtab1 as b on (a.tab_bk = b.tab_bk 
 and a.tab_bez = b.tab_bez and a.tab_n1 = b.tab_n1 
 and a.tab_n2 = b.tab_n2 and a.tab_id > b.tab_id)

) as x

on Y.tab_bk = x.tab_bk and y.tab_bez = y.tab_bez 
 and y.tab_n1 = x.tab_n1 and y.tab_n2 = x.tab_n2 
 and y.tab_id = x.tab_id

when matched then delete;

Im inneren Select-Statement verwenden wir einen Self-Join über alle Spalten und prüfen auf Gleichheit. Von den so gefundenen Datensätzen wird nur der größere Wert für den PK (tab_id) verwendet.

Als Ergebnis erhält man eine Tabelle ohne Dubletten:

Ein wenig anders sieht die Sache aus, wenn die Tabelle, die die Dubletten enthält, keinen Primärschlüssel aufweist. Auch hierzu schnell ein kleines Beispiel:

create table SIMULACRUM.dbo.dubtab2

      tab_bk nvarchar(20) not null,

      tab_bez nvarchar(50),

      tab_n1 nvarchar(50),

      tab_n2 nvarchar(50)

);

Und auch hier stellen wir die gleichen Datensätze ein (wenn man einmal vom fehlenden PK absieht):

insert into SIMULACRUM.dbo.dubtab2 values ('001','Bezeichnung 1','blah1','blubb1');

insert into SIMULACRUM.dbo.dubtab2 values ('002','Bezeichnung 2','blah2','blubb2');

insert into SIMULACRUM.dbo.dubtab2 values ('003','Bezeichnung 3','blah3','blubb3');

insert into SIMULACRUM.dbo.dubtab2 values ('004','Bezeichnung 4','blah4','blubb4');

insert into SIMULACRUM.dbo.dubtab2 values ('005','Bezeichnung 5','blah5','blubb5');

insert into SIMULACRUM.dbo.dubtab2 values ('001','Bezeichnung 1','blah1','blubb1');

insert into SIMULACRUM.dbo.dubtab2 values ('002','Bezeichnung 2','blah2','blubb2');

insert into SIMULACRUM.dbo.dubtab2 values ('003','Bezeichnung 3','blah3','blubb3');

insert into SIMULACRUM.dbo.dubtab2 values ('004','Bezeichnung 4','blah4','blubb4');

insert into SIMULACRUM.dbo.dubtab2 values ('005','Bezeichnung 5','blah5','blubb5');

insert into SIMULACRUM.dbo.dubtab2 values ('005','Bezeichnung 5','blah5','blubb5');

Ein select * from SIMULACRUM.dbo.dubtab2 order by tab_bk ergibt folgendes Ergebnis:

Auch hier sind alle Datensätze mindestens doppelt vorhanden, die Datensätze mit tab_bK = ‚005‘ sogar dreifach.

Für die Dubletteneliminierung verwenden wir jetzt eine Common Table Expression (CTE):

with x as

(select     row_number() over (partition by tab_bk order by tab_bk) as nr

from  SIMULACRUM.dbo.dubtab2)

delete from x where x.nr > 1

Wieder ergibt sich das gewünschte dublettenfreie Ergebnis:

Diese Art der Dubletteneliminierung ist eleganter als das erste Verfahren unter Verwendung eines Merge-Befehls und lässt sich genauso gut auch für das erste Beispiel verwenden. Streng genommen müsste man aber die partition-by-clause noch um die übrigen Nicht-Schlüsselfelder erweitern um mit Sicherheit in allen Spalten doppelt vorhandene Sätze zu eliminieren. So wie die CTE jetzt formuliert ist, eliminiert sie Sätze schon dann, wenn nur die Spalte tab_bk mehrfach vorhanden ist.

Mit dem gleichem Thema angewendet auf sehr große Tabellen befasst sich auch der Beitrag meines Kollegen Hilmar Buchta.

by Jörg Menker

On normal SMP-SQL Servers there are several ways to perform string aggregations via XML or Pivot and dynamic SQL, some more or less efficient.

On APS/PDW you have to do something different.
We want to have a rapid solution so we like “JOIN” and simple additions and we hate anything like “CASE”, or “ISNULL”.

Let’s start with a simple table:

We have several items with their rownumbers in one table. In order to get them side-by-side we create a matrix/table which has an ID, an empty string at its diagonal und a NULL-string in all other fields. Next we perform a join and get a table with the items in its diagonal because addition with the NULL-value results in a NULL value and addition of empty string and item results in the item.

One max-aggregation and one concat-operation later we have the result: one row and all items concatinated.

Click to enlarge:

Now we modify this attempt to perform an aggregation of items in multiple lines.

The matrix gets a space-character instead of the empty string from row 2 on (or any other delimeter). Next we insert a column with a partitioned rownumber and join the matrix by this. By keeping the line information we can group by this linenumber. Next the concat and we have our little christmas-song.

Click to enlarge:

Thanks to Miriam Funke who implemented and tested it on the PDW.

We used this concept on a 4-node-pdw to aggregate receipt-items for side-by-side analysis. The matrix allows up to 200 items per receipt and we reduced about 60 mio lines to 5 mio lines in about 2 minutes even though we needed to perform a dense_rank and a row_number at once to eliminate same items of the receipt in one step.

Merry Christmas!

Here is a query to test the concept on standard smp including creation of sample data (remove “–” in last line to execute query):

DECLARE @i INT
DECLARE @i_max INT = 100 –max aggregate items

DECLARE @query NVARCHAR(max)

/***** 1 create matrix for join *****/

SET @query =
‘
CREATE TABLE #T1
(
rowid INT
‘
SET @i = 1
WHILE @i <= @i_max
BEGIN
SET @query +=
‘
,T’+CAST(@i AS NVARCHAR(10)) + ‘ NVARCHAR(1) ‘

SET @i += 1
END

SET @query +=
‘
)

‘

SET @i = 1
WHILE @i <= @i_max
BEGIN
SET @query +=
‘
INSERT INTO #T1 (rowid, T’+CAST(@i AS NVARCHAR(10))+’) SELECT ‘+CAST(@i AS NVARCHAR(10)) + ‘,””
‘
SET @i+= 1
END
/***** CREATE SAMPLE DATA *****/

SET @query +=
‘
CREATE TABLE #T2
(
ID INT IDENTITY(1,1),
VBELN INT,
POSNR NVARCHAR(10)
)

DECLARE @i INT = 1
DECLARE @imax INT = 1000000

WHILE @i <= @imax
BEGIN
INSERT INTO #T2
SELECT ROUND(RAND(@i) * 30000,0), NULL
SET @i+= 1
END
UPDATE T2 SET POSNR = POSNR_SOLL
FROM #T2 AS T2
JOIN
(
SELECT ID, POSNR_SOLL = CAST(ROW_NUMBER() OVER (PARTITION BY VBELN ORDER BY VBELN) AS nvarchar(10))
FROM #T2 AS T2
) AS SOLL
ON
SOLL.ID = T2.ID

‘

/*************** MAIN TASK ***********/

SET @query +=
‘
SELECT
SINGULAR.VBELN,
C_POSNR = CONCAT(
‘
SET @i = 1
WHILE @i <= @i_max
BEGIN
IF @i = 1 SET @query += ‘MAX(T’+CAST(@i AS NVARCHAR(10))+’+POSNR)’
ELSE SET @query += ‘,MAX(+”,”+T’+CAST(@i AS NVARCHAR(10))+’+POSNR)’
SET @i += 1
END
SET @query += ‘
)

FROM
(
SELECT *, RN = ROW_NUMBER() OVER (PARTITION BY VBELN ORDER BY POSNR) FROM #T2
) AS SING
JOIN #T1 AS PIV
ON
PIV.ROWID = SING.RN
GROUP BY
SINGULAR.VBELN
ORDER BY
SINGULAR.VBELN
‘

SET @query +=
‘

DROP TABLE #T1
DROP TABLE #T2
‘

SELECT @query
EXEC (@query)

by Hans Klüser

SQL Server 2012 | SQL Server 2014 | PDW/APS 2012

Recently we needed to calculate something like a ‘last non empty’ value in a T-SQL query. This blog post is about the solution we ended up with as an alternative to the classic pattern involving sub-queries.

To illustrate the task let’s first look at some data:

The extract shows a contract table with some gaps. The task is to fill the gaps with the last non empty contract of the same id. So, here is the final result:

As you can see, apart from the first days for id 2 which don’t have a last value, all gaps have been filled.

In order to fill the gaps using T-SQL window functions, the idea is to calculate the number of steps we need to go back for each null value to catch the corresponding last value. In the following screenshot, I’m showing this value as the last column:

For example, for ID 1, Date 2014-07-17 we have to go two rows back (2014-07-15) to get the last non empty value. For the first two dates for ID 2 we also have a lag-value, however there is no corresponding row. Looking at the lag columns it somewhat looks like a row_number over rows with a contract value of NULL. Actually, looking at ID 2 there may be more than one gap (NULL values) so it’s more like a row number over groups of contracts. To determine those groups we need to find changes in the contract over time. So let’s start with this first.

with
C1 as
(select ID, Date, Contract
, iif(isnull(Contract,”) <> isnull(lag(Contract,1) over (partition by ID order by Date),”),1,0) ContractChange
from [dbo].[Contracts])

select * from C1 order by ID, Date

Using the lag window-function I added a new column ‘ContractChange’ that gives 1 whenever the contract changes and 0 otherwise. The next step is to calculate a running total of the column to build up groups of contracts:

with
C1 as…
C2 as
(select ID, Date, Contract, ContractChange,
sum(ContractChange) over (partition by id order by Date) ContractGroup
from C1)

select * from C2 order by ID, Date

The new column ‘ContractGroup’ now calculates a value that increments whenever the contract changes. We can now calculate a row_number using the ContractGroup column as the partition:

with
C1 as…
C2 as…
C3 as
(select ID, Date, Contract, ContractChange, ContractGroup,
row_number() over (partition by id, ContractGroup order by Date) LastContractLag
from C2)

select * from C3 order by ID, Date

And actually, the LastContractLag column here is already the value we need for the lag-function to get to the non-empty value. So here is the final query (including the intermediate calculations from above):

with
C1 as
(select ID, Date, Contract
, iif(isnull(Contract,”) <> isnull(lag(Contract,1) over (partition by ID order by Date),”),1,0) ContractChange
from [dbo].[Contracts])
,
C2 as
(select ID, Date, Contract, ContractChange,
sum(ContractChange) over (partition by id order by Date) ContractGroup
from C1)
,
C3 as
(select ID, Date, Contract, ContractChange, ContractGroup,
row_number() over (partition by id, ContractGroup order by Date) LastContractLag
from C2)

select ID, Date, Contract
,iif(Contract is null, lag(Contract,LastContractLag) over (partition by id order by Date),Contract) ContractLastNonEmpty
from C3
order by ID, Date

The output of this query is shown above (final result). And again this is a good example of the power of window functions.

Conclusion

In our situation, this solution performed much better than a sub-query approach, but depending on the table layout and the amount of data, other approaches may still be better, so you may want to try different patterns for solving this problem.

by Hilmar Buchta

SQL Server 2012 | SQL Server 2014

Microsoft’s Analytics Platform System (APS) offers built in transparent access to Hadoop data sources through the Polybase technology. This includes bidirectional access not only to Hadoop but also to Cloud services. The SMP SQL Server currently doesn’t contain Polybase, so access to Hadoop needs to be handled differently. Will Polybase be available in an upcoming SMP SQL Server? From the past we saw some technology making its way from PDW to SMP SQL Server, for example the clustered columnstore index, the cardinality estimation or the batch mode table operations. So let’s hope that Polybase makes it into the SMP SQL Server soon. Until then, one option is to use the HortonWorks ODBC driver and linked tables. To be honest, Polybase is a much more powerful technology since it uses cost-based cross platform query optimization which includes the ability to push down tasks to the Hadoop cluster when it makes sense. Also, Polybase doesn’t rely on Hive but access the files directly in parallel, thus giving a great performance. Linked tables are less powerful but may still be useful for some cases.

So, here we go. First, you need to download the ODBC driver from the Hortonworks add-ons page: http://hortonworks.com/hdp/addons/.

Make sure you pick the right version (32 bit/64 bit) for your operating system. After the installation completes, we need to set up an ODBC connection. Therefore, start the ODBC Datasource Adminstrator (Windows+S, then type ‘ODBC’). Again, make sure to start the correct version (32 bit/64 bit). The installer has already created a connection but you still need to supply the connection properties. I created a new connection instead:

I’m connecting to the Hortonworks Sandbox here (HDP 2.1, I had problems connecting to HDP 2.2 with the current version of the ODBC driver). Instead of the host name you can also enter the IP address (usually 127.0.0.1 for the sandbox) but in order to get other tools running (like Redgate Hdfs Explorer) I configured the sandbox virtual machine to run on a bridged network and put the bridge network IP address of the sandbox (console command “ip addr”) in my local host file.

You should now click on Test to verify that the connection actually works:

In SQL Server Management Studio we can now create a linked server connection to the Hadoop system using the following command:

EXEC master.dbo.sp_addlinkedserver
@server = N’Hadoop’,
@srvproduct=N’HIVE’,
@provider=N’MSDASQL’,
@datasrc=N’HDP’,
@provstr=N’Provider=MSDASQL.1;Persist Security Info=True;User ID=hue;’

Depending on you Hadoop’s security settings, you might need to provide a password for the provider string as well. The @server name is used to refer to the linked server later while the @datasrc names the ODBC connection (see “Data Source Name” in the configuration dialog of the connection above).

With the new linked server, we can now explore the Hive database in Management Studio:

In order to run a query on for example table “sample_07” you can user one of the following commands:

select * from openquery (Hadoop, ‘select * from Sample_07′)

select * from [Hadoop].[HIVE].[default].[sample_07]

For both queries, “Hadoop” refers to the name of the linked server (@server parameter in the SQL statement from above).

If you get the following error message, this means that you are not allowed to query the table:

OLE DB provider "MSDASQL" for linked server "Hadoop" returned message "[Hortonworks][HiveODBC] (35) Error from Hive: error code: ‘40000’ error message: ‘Error while compiling statement: FAILED: HiveAccessControlException Permission denied. Principal [name=hue, type=USER] does not have following privileges on Object [type=TABLE_OR_VIEW, name=default.sample_07] : [SELECT]’.".
Msg 7306, Level 16, State 2, Line 1
Cannot open the table ""HIVE"."default"."sample_08"" from OLE DB provider "MSDASQL" for linked server "Hadoop".

In this case, you should simply give the user from you ODBC connection the SELECT right. To do so, run the following query in Hive:

grant select on sample_07 to user hue;

That’s it. You should now get the contents of the table in SQL Server:

You might want to set the length of string columns manually because Hive does not return the size of the string column (in Hive, the column type is simply “string”). The size returned from the query results from the advanced ODBC-settings of our connection. I left everything on default here, so here is how it looks:

So, the default string column length is 255 here. Let’ check and copy the data over to SQL Server:

select * into sample_07 from [Hadoop].[HIVE].[default].[sample_07]

The resulting table looks like this:

To have a more precise control of the column length, you should use the convert function here, for example:

select
convert(nvarchar(50),[code]) [code],
convert(nvarchar(80),[description]) [description],
total_emp,
salary
from [Hadoop].[HIVE].[default].[sample_07]

Be careful with the remaining setting in the advanced options dialog. For example, checking “Use native query” means that you pass the query (openquery-Syntax) as it is to Hive. This could be intended to fully leverage specific features of Hive, but this could also lead to errors if you’re not familiar with the HiveQL query syntax. Also, to get a better with larger tables you might want to adjust the “Rows fetched per block” option to a larger value.

With HDP 2.2 you should also be able to write to the table (create a new table, grant all permissions and run an insert into) but I couldn’t do on my HDP 2.1 machine.

Summary

Until Polybase makes it into the SMP SQL Server product, Hadoop data may be queried from SQL Server using the ODBC driver and the linked server object. This could also be an option for Analysis Services to connect to Hadoop by using SQL Server views via linked server, since Analysis Services doesn’t support ODBC in multi dimensional mode. However, Polybase on the APS gives a much better performance because of the intelligent cross platform query optimizer and Polybase can also be used to write data to Hadoop, so I hope we’ll find this technology in the SMP SQL Server soon.

by Hilmar Buchta

Quite often OLAP users report performance problems from the past like “last Wednesday between 14:00 and 17:00”.

For retrospective OLAP performance analysis we need not only the monitoring information and query traces but also the complete picture of events that could have an influence on the system performance.

That is why we strongly recommend to have a logging of such events as a part of production system policies.

In a simple case it can be just a plain Excel file placed on a network share. But the best way is to have this information integrated with your monitoring system. This can dramatically increase the performance and easiness of your analysis.

You can start logging following events while adding other categories that are relevant:

Change of aggregation design
Inactive aggregations detected (unassigned or empty)
Change of the configuration or version of the OLAP-Engine
Change in the software (not OLAP-Engine) or hardware configuration
Deployment new cube version
System maintenance
Unplanned cube processing
Any non-regular OLAP activities (performance tests, heavy one time queries)

It is also recommended to have at least following attributes in your logging:

Time or time span
Relevant object (system, instance, OLAP object, etc.)
Planned/unplanned

User feedback

Ideally the users should also have a channel to easily report performance issues. Technically it can be a simple web frontend referenced directly from OLAP client using cube actions. At least the following info should be logged:

User reporting the issue
Time or time span of issue observed
Report, cube or other info localizing the issue
Severity or priority

We recommend to have this part of OLAP infrastructure starting with first system deployment in order to have a complete history for analysis.

by Michael Mukovskiy

MDX ist der de-facto Standard zur Abfrage multidimensionaler Datenbank. Aber trotz der syntaktischen Ähnlichkeit zu SQL ist es für viele ein Buch mit sieben Siegeln.

Dabei ist MDX gar nicht schwer. Wenn man einmal die Grundbegriffe und die Ideen verinnerlicht hat, fällt es leicht, selbst sehr komplexe Abfragen und Berechnungen in MDX zu formulieren. Um dies zu erreichen, vermittelt die Schulung nicht nur die notwendige Theorie, sondern enthält zahlreiche praktische und teilweise schon recht anspruchsvolle Übungsaufgaben, die die Teilnehmer direkt umsetzen können.

Themen im Überblick:

Überblick und Grundbegriffe
Berechnungen in der Abfrage
MDX-Abfragen für Reporting Services erstellen
Berechnungen im Cube festlegen
Nützliche Tools, Literatur und Ausblick
Zahlreiche praktische Übungen

Veranstalter: TDWI Germany e.V. und SIGS DATACOM

Referentin: Christina Bräutigam, Consultant
Die Diplom-Mathematikerin ist seit Januar 2012 bei ORAYLIS tätig und hat sich unter anderem auf den Bereich MDX spezialisiert. In enger Zusammenarbeit mit Herrn Hilmar Buchta (Geschäftsführer bei ORAYLIS GmbH – er zählt zu den weltweit renommiertesten MDX Kapazitäten) erlernte Frau Bräutigam seine Schulungsmethoden. Frau Bräutigam gibt regelmäßig Schulungen zu diesem Thema und geht hierbei gerne auf die individuellen Wünsche und Anforderungen des Teilnehmerkreises ein.

Juni 2015

Ort:Düsseldorf
Am:11.06.2015 – 12.06.2015
Von:09:00 – 17:30

Termine, Details & Anmeldung: TDWI Seminar MDX Intensiv

by Emilija Sila

In jedem Unternehmen werden heutzutage Zahlen in Form von Dashboards, Statistiken und Berichten kommuniziert. Schaut man sich verschiedene dieser Auswertungen an, so begegnet man selbst innerhalb eines Unternehmens den verschiedensten Farben, Formen und Bezeichnungen für meist gleiche Sachverhalte. Im Rahmen meiner Zertifizierung zum HICHERT®IBCS Certified Consultant (HCC) durfte ich mich nun mit dem Thema Notation näher beschäftigen. In dieser neuen Blog-Reihe werde ich auf die Wichtigkeit verbindlicher Standards für das Berichtswesen eingehen. Ein Ziel dieser Reihe soll sein, mit dem einen oder anderen Ressentiment gegen Notationen im Berichtswesen aufzuräumen. Vielmehr sollten diese „Regeln“ als Chance für das Unternehmen gelten. Im ersten Teil klären wir, wo die Vorteile einer einheitlichen Notation für das eigene Unternehmen -als auch darüber hinaus- liegen. Ein weiterer Teil befasst sich mit den IBCS Notationsstandards im Zusammenhang mit den Hichert® SUCCESS-Regeln. Darauf folgen praxisgerechte Vorschläge, wie man hieraus eine Notation für Unternehmen ableiten kann. Los geht es also mit Teil 1. Standards in der Berichtserstellung – Teil 1: Vorteile einer globalen Notation In vielen professionellen Bereichen werden heutzutage Notationen verwendet: Bereits seit dem 11. Jahrhundert wird in der Musik das heute noch gebräuchliche Notensystem verwendet. Jeder Musiker, der Noten lesen kann, kann Tonhöhe, Geschwindigkeit und weitere Informationen aus diesem System mit 5 Linien ablesen.

Abb.1

Dabei ist es egal, ob es sich um klassische Werke wie Mozart oder Beethoven, Jazz, Schlager, Rock, Pop oder was auch immer handelt.

Ein weiteres Beispiel sind elektrische Schaltpläne. Jedem dürften aus dem Physikunterricht diese Symbole geläufig sein:

Abb.2

Die meisten werden erkennen, dass es sich um eine Stromquelle handelt, an der ein Stromverbraucher mit einem dazwischen liegenden Schalter handelt.

Auch in der Architektur werden Symbole verwendet:

Abb. 3

Ob es sich um Pläne handelt, die mit einem CAD-Programm oder per Hand gezeichnet wurden: auch hier gibt es seit Jahrzehnten die gleiche Symbolik. Wer sich dagegen einmal die Mühe macht, die Umsatzzahlen in Geschäftsberichten der DAX 30- Unternehmen zu vergleichen, wird dagegen ein buntes Potpourri an unterschiedlichen Zahlendarstellungen finden:

Abb. 4

Da werden die Umsatzzahlen zeitlich mal horizontal aufsteigend, mal vertikal absteigend, dann wieder horizontal absteigend, dann mit dicken Säulen oder schmalen Balken, dann mit gestaffelten Balken usw. dargestellt. Die Darstellungen sind meistens an die verschiedenen Corporate Designs der Unternehmen angelehnt und bestehen aus viel Farbe und unterschiedlichen Symbolen. Zu Klarheit und Vergleichbarkeit tragen sie nur bedingt bei. In den meisten Jahresbilanzen der Konzerne müssen die Grafiken und Tabellen daher auch mit viel Text erläutert werden.

Aber auch innerhalb eines Unternehmens stößt man selten auf einheitliche Berichtsformen. Jede Abteilung berichtet in ihrer eigenen Form. In diesem (zugegebenermaßen extremen) Beispiel werden Umsatzzahlen aus verschiedenen Abteilungen eines Unternehmens an das Controlling geliefert:

Abb. 5

Abb. 6

Nun möchte ein Controller beide Abteilungen miteinander vergleichen und das Ergebnis an das Management kommunizieren. Dies ist mit den vorhandenen Charts kaum möglich. Es wird ihm nichts anderes übrig bleiben, Zeit zu investieren und die Zahlen noch einmal neu zu visualisieren. Wäre hier eine einheitliche Notation vorhanden, so würde sich relativ schnell ein klares Bild der Umsätze im Vergleich beider Abteilungen ergeben:

Abb. 7

Für diese Darstellung wären folgende Notationspunkte denkbar:

– Abteilungs-Umsätze werden auf der Vertikalachse immer gleich skaliert

– Bei zeitlichem Bezug liegt der Zeiten-Wert immer auf der Horizontalachse und wird aufsteigend sortiert

– Vorjahre werden grau, das aktuelle Jahr schwarz und Forecast-Zahlen weiß-schwarz schraffiert dargestellt

– Das Zahlenformat beträgt immer „in Millionen Euro“

– Die Zahlen werden oberhalb der Säulen dargestellt

– Die Breite der Säulen entspricht denen für Werte. Volumen werden schmaler dargestellt

– Hilfslinien auf den Achsen sind nicht zulässig

– Der Name der Abteilung befindet sich oberhalb des Charts

– Ein senkrechter Strich dient als optische Abgrenzung von Ist-Zahlen zu Forecast- oder Plan-Zahlen

Darüber hinaus können noch Schriftart und Schriftgröße, der Abstand zwischen den Säulen, Farbe der Achsen usw. festgelegt werden. Dies klingt zunächst nach Bevormundung der Ersteller, ist bei genauerer Betrachtung aber eher hilfreich denn hinderlich. Denn bei dem obigen Beispiel zeigt sich ein großer Vorteil einer einheitlichen Notation: in erster Linie sollte die Darstellung von Zahlen dazu dienen, deren Aussagen visuell zu erfassen und bestenfalls Entscheidungen davon abzuleiten. Sind Bedeutungen in diesen Darstellungen vereinheitlicht, kann sich der Betrachter schneller auf das wesentliche konzentrieren und muss sich nicht erst fragen, was der Verfasser ihm damit sagen möchte.

Werden dann noch Besonderheiten durch das sogenannte „Highlighting“ optisch hervorgehoben, dann kann man von einem Bericht sprechen, der im wahrsten Sinne des Wortes etwas berichtet. Mehr zum Highlighting in einem späteren Beitrag zu diesem Thema.

Nun stelle man sich vor, die im „Chartsalat“ in Abb. 4 vorhandenen Unternehmen würden mit einer annähernd gleichen Symbolik ihre Zahlen darstellen: spätestens nach dem zweiten Chart hätte der Betrachter verstanden, um was es bei den weiteren Darstellungen geht. Das Lesen von Geschäftsberichten wäre ähnlich leicht erfassbar wie Musiknoten oder Schaltskizzen. Darüber hinaus könnten die erklärenden Texte weitaus kürzer ausfallen als bisher.

Diese Notationsgedanken für die Geschäftskommunikation verfolgen die International Business Communication Standard (ibcs®). In Teil 2 dieses Blogs werden wir den Verein und die bisher erstellten Standards näher vorstellen.

by Arno Cebulla

Seit über 10 Jahren beschäftigt sich Dr. Rolf Hichert mit der Visualisierung von Managementinformationen und deren Verbesserung. Die von ihm zusammengestellten SUCCESS-Regeln bilden die Richtschnur für verständliche, effektive und effiziente Berichte und Präsentationen. In zahlreichen Vorträgen und Seminaren vermittelt Dr. Hichert die Inhalte dieser Regeln.

Abb. 1

Vor gut 2 Jahren wurde auf Initiative von Dr. Hichert der Verein ibcs® – International Business Communication Standards- gegründet. Dieser Verein nach schweizerischem Recht hat sich zum Ziel gesetzt, „das Niveau des Verständnisses in schriftlichen Unterlagen für die geschäftliche Kommunikation zu fördern“. Grundlage dieser Standards bilden die SUCCESS-Regeln. IBCS entwickelt diese Regeln weiter und stellt diese zur öffentlichen Diskussion auf die vereinseigene Webseite.

Diese Standards sind nun in einer Draft-Version abrufbar. Die Standards werden in drei Blöcke eingeteilt. Diesen Blöcken sind die bekannten SUCCESS-Regeln zugeordnet:

1. Konzeptionelle Regeln

a. SAY

b. STRUCTURE

Diese Regeln basieren unter anderem auf dem Pyramiden-Prinzip von Barbara Minto (The Pyramid Principle, 3. Edition, 2002).

2. Perzeptuelle Regeln

a. EXPRESS

b. SIMPLIFY

c. CONDENSE

d. CHECK

Bei diesen Regeln geht es vornehmlich um die Wahrnehmung. Die Grundlagen der Inhalte sind unter anderem an die Veröffentlichungen von Edward Tufte (The Visual Display of Quantitative Information, 2. Edition, 2011) und Stephen Few (Show Me the Numbers, 2. edition, 2012) angelehnt.

3. Semantische Regeln

a. UNIFY

Zu den Grundlagen der konzeptionellen und perzeptuellen Regeln existieren bereits zahlreiche Veröffentlichungen. Was bisher noch fehlt, ist die Beschreibung der semantischen Regeln, also die Vereinheitlichung von Bedeutungen. Genau hier setzt der Grundgedanke der ibcs an: die Schaffung einer international gültigen Regel zur Vereinheitlichung der Geschäftskommunikation.

Zur Erreichung dieser Vereinheitlichung der Bedeutungen wurden bereits 10 Regeln öffentlich besprochen und die Ergebnisse diskutiert.

Abb. 2

Der besondere Charme liegt nun darin, dass nicht irgendein Gremium diese Regeln bestimmt und alle sich danach richten sollen. Dadurch, dass die Vorschläge öffentlich diskutiert werden, kann sich jeder Interessierte an der Diskussion beteiligen. Dadurch wird die Akzeptanz dieser Regeln gesteigert. Denn ein Regelwerk macht erst dann Sinn, wenn dieses auch eingehalten wird. Natürlich können die Regeln nicht in jedem Unternehmen 1:1 übernommen werden. Aber sie bieten eine Richtschnur, an der man sich orientieren kann.

In der Praxis stoßen wir in Projekten immer häufiger auf die Anforderung, Berichte nach diesen Standards zu erstellen. Grundlage hierfür ist ein umfassender Style Guide, in dem die Notationen der Business Kommunikation für das Unternehmen beschreiben sind. Mit diesem Style Guide werden wir uns im 3. Teil dieses Blogeintrages beschäftigen.

by Arno Cebulla

The Data Vault Model is known for its flexibility and its ability to scale out. One of the reasons which makes the model so scalable is the way relationships are realized. Despite how the relationship is modelled in the source system, it will be a Many-to-Many-Table in the Data Vault Model. This Many-To-Many Tables are called Links. Links contain Surrogate Keys of two or more Hub Tables. They store a distinct list of Key Combinations of the relationship, they represent.

In today’s post I choose a source table SalesOrderDetail modelled as a Link Table SalesOrderDetail_Lnk. The SalesOrderDetail from the Adventure Works 2012 has a relationship to the Product and to the SalesOrder. The Link-Table SalesOrderDetail _Lnk is referencing the Product_Hub and the SalesOrder_Hub. Therefore the foreign keys are Product_Seq and SalesOrder_Seq.

Data Vault1_0_Link_SalesOrderDetail

Like the Hub Table also a Link Table needs some Metadata Columns. So I added the columns LoadTimestamp, LoadProcess and the RecordSource to the Table. In my article about Hub Loads this columns are descripted more detailed.

Data Vault 1.0 Link Load

DataVault1_0LinkLoadPattern

To load a Data Vault 1.0 Link Table a distinct list of Business Keys have to be selected from the Stage. More precisely the distinct list of Business Key – combinations have to be selected. For every single of those Business Keys a Lookup has to be made to get the Surrogate Key of the related Hub Table. Business Key combinations which already exist inside the Link Table will be dropped from the Data Flow. Before a new Row will be inserted into the Link Table, a Surrogate Sequence Key will be added to the Row.

Example

A SQL Server Integration Services Package loads the SalesOrderDetail-Data into a Stage Table SalesOrderDetail _Sales_AdventureWorks2012.

SalesOrderDetail (Source) >> SalesOrderDetail_Sales_AdventureWorks2012 (Stage)

The Link Load Pattern is used to load the Data into the Table SalesOrderDetail_Lnk.

SalesOrderDetail _Sales_AdventureWorks2012 (Stage) >> SalesOrderDetail _Lnk (Destination)

T-SQL – Implementation

The stored procedure Load_SalesOrderDetail_Link represents the Data Vault 1.0 Link Load Pattern. In here i use a Sequence as Default Constrain to generate the Surrogate Key for the Link Table. The Parameter @LoadProcess has to be set by the ETL Process that executes the procedure. It can be generated by any tool or scheduler that is running your ETL-Workflow. If you use a SSIS Package to execute your stored procedure i recommend to use the ServerExecutionID within SSIS.

DataVault 1_0 Link Load TSQL

SQL Server Integration Services – Implementation

The implementation in SQL Server Integration Service displays more visual the simplicity of this load pattern. Like in the T-SQL example above the Surrogate Key is generated by a Sequence used as Default Constrain. This method is described more detailed in my previous post about Hub Loads.

Data Vault 1_0 Link Load SSIS

Conclusion

The implementations are examples to illustrate the pattern for loading a Data Vault 1.0 Link Table. Individual project conditions will lead to individual implementation of this pattern. The examples show how simple this pattern can be realized. The benefit of having simple pattern is the ability to automate those processes. Standardised simple and repeatable pattern are making ETL generating possible.

Link Tables form together with the Hub Tables the skeleton of the Data Vault model. The use of Many-To-Many Tables makes the Data Vault Model very flexible. When the Data Vault Model gets extended there will be almost zero refactoring effort. This abilities are making Enterprise Data Warehouse projects more agile and flexible.

by Daniel Piatkowski

Nachdem wir in Teil 1 und Teil 2 die theoretischen Aspekte der Standards betrachtet haben, geht es in diesem Teil um die praktische Umsetzung in einen Style Guide.

Grundsätzliches

Der Style Guide dient als Richtlinie für Berichte, die im Rahmen der Unternehmenskommunikation erstellt werden. Hierbei empfiehlt es sich, diesen Guide als eigenständige Vorgabe zu implementieren. Bereits vorhandene Instrumente wie Corporate Designs dienen der Außendarstellung des Unternehmens. In der Unternehmenskommunikation geht es aber darum, schnell zu informieren und Entscheidungen zu unterstützen. Daher macht es zum Beispiel wenig Sinn, dass ein Ersteller eines internen Berichts sich mit der richtigen Platzierung des Firmenlogos auf seinem Dokument beschäftigt. Zumal davon auszugehen ist, dass der Empfänger weiß, in welchem Unternehmen sowohl Ersteller als auch Empfänger arbeiten. Das Logo hat außerdem keine Relevanz für die Aussage des Berichts und nimmt unter Umständen wertvollen Platz für relevantere Elemente ein.

Bei der Vorgabe von Farben sollte man sich ebenfalls von vorhandenen Vorgaben für die Außendarstellung lösen. Denn gerade mit Farbe kann man sehr gut Aussagen verstärken. So erklärt sich zum Beispiel von selbst, dass positive Zahlen in Grün und negative Zahlen in roter Farbe dargestellt werden. Sieht das Corporate Design aufgrund der Logofarbe nun aber nur Grün- oder Rottöne vor, ist dieses logische Konzept der Zahlendarstellung nicht mehr durchführbar.

Abb. 1

Die Farbe Blau hat ebenfalls eine herausragende Bedeutung. Mit ihr wird das sogenannte Highlighting vorgenommen. Darin wird die Aussage des Berichtes verstärkt und der Blick des Empfängers auf den Focus der Aussage gelenkt:

Abb. 2

Inhalt

Aus der Praxis empfiehlt es sich, den Style Guide in einen theoretischen und einen praktischen Teil zu gliedern.

Im theoretischen Teil wird neben der Handhabung des Guides auch grundsätzlich erklärt, nach welchen Aspekten die Vorgaben erstellt wurden. Diese Beschreibungen dienen zum einen dazu, dass sowohl neue als auch Stamm-Mitarbeiter sich jederzeit in die Thematik einlesen können. Zum anderen hilft es dabei, die Gültigkeit des Style Gudes zu überprüfen und ggf. anzupassen. Mehr dazu etwas später in diesem Teil.

Abb. 3

Im praktischen Teil werden die Berichtselemente Chart, Tabelle und Text und deren Verwendung beschrieben. Die Vorgaben orientieren sich an den ibcs®-Notationsregeln. Bei der Auswahl ist besonders darauf zu achten, dass diese Regeln mit den vorhandenen technischen Möglichkeiten umsetzbar sind. Hiervon hängt auch ab, wie detailliert die Elemente vorgegeben werden. Dies kann bis zu Millimeterangaben von Linien und Abständen führen.

Abb. 4

Da sich das Dokument mit Visualisierung beschäftigt, können die Regeln natürlich anschaulich vermittelt werden.

Das Ergebnis

Am Ende steht ein ausführlicher Style Guide, der auf Basis der ibcs-Notationsregeln eine schnelle Unternehmenskommunikation erreichen soll.

Wichtig ist, dass das die Inhalte sowohl von Erstellern als auch den Empfängern akzeptiert wird. Um dieses zu erreichen, sollten im Vorfeld der Erstellung die Anforderungen beider Parteien eruiert werden. Auch im Erstellungsprozess werden diese Mitarbeiter einbezogen. In der Praxis hat sich hier die Bildung eines „Style Guide“-Gremiums als gute Möglichkeit erwiesen. Dieses Gremium beschließt die Einträge in den Style Guide. Auch nach Fertigstellung kommt diese Gruppe ein- bis zweimal im Jahr zusammen und überprüft die Aktualität des Guides und beschließt eventuelle Neueintragungen oder Änderungen. Damit bleibt der Style Guide immer auf dem aktuellen Stand.

Natürlich ist die Erstellung des Style Guides zunächst mit einiger Arbeit verbunden. Dafür erleichtert dieses Dokument aber den Arbeitsalltag sowohl der Ersteller als auch der Empfänger der Berichte enorm.

Die ORAYLIS GmbH unterstützt Sie bei der Erstellung dieses Dokuments. In enger Anlehnung an diese Grundsätze erstellen wir mit Ihnen einen Style Guide für Ihr Unternehmen, der Vorlagen für alle relevanten Anforderungsfälle enthält und diese beschreibt. Mit dieser Unterstützung können Sie selbst komplexe Informationen schnell und einfach in Charts und Tabellen fassen, die für jeden Betrachter leicht nachvollziehbar sind und an die man sich auch langfristig erinnert.

In Teil 4 dieses Blog-Beitrages geht es dann um die konkrete Umsetzung eines Berichts und die Frage, wie man vorhandene Berichte analysiert und diese dann gemäß Notationsregeln umsetzt.

by Arno Cebulla