Implementierung einer phonetischen Suche nach dem Kölner Algorithmus

January 28, 2014, 5:17 am

≫ Next: Automatic handling of external memory pressure on a SQL-Server

≪ Previous: T-SQL 2012 Window Functions in ELT (part 3) – calculating price elasticity

Ein regelmäßig auftretendes Problem in großen Datenmengen ist es, mögliche redundante Daten aufzuspüren und zu eliminieren. Man denke hier z.B. an Dubletten im Kundenbestand. Die Schwierigkeit dabei ist, dass diese Dubletten nicht exakt der gleichen Schreibweise folgen müssen. So wäre beispielsweise Thomas Müller, Musterstrasse 12 identisch mit Tomas Mueller, Musterstraße 12.

Um diesem Problem zu begegnen existieren verschiedene Ansätze. Alle beruhen darauf, Zeichenketten mit unterschiedlicher Schreibweise miteinander vergleichbar zu machen. So ist ein Ansatz die Fuzzy-String-Suche oder auch unscharfe Suche, die mit Hilfe verschiedener String-Matching-Algorithmen versucht, eine Zeichenkette in einer längeren Zeichenkette oder einem Text zu finden. Hier spielt die Levenshtein-Distanz eine wichtige Rolle, die die minimale Anzahl von Einfüge-, Lösch- und Ersetz-Operationen angibt, um die erste Zeichenkette in die zweite umzuwandeln (s. http://www.levenshtein.de).

Eine andere Möglichkeit zum Vergleichen von Zeichenketten ist die phonetische Suche. Hierbei wird mit Hilfe der Phonetik versucht eine klangliche Repräsentation der Zeichenketten zu finden und diese zu vergleichen.

Es gibt verschiedene Verfahren, die sich in ihrer jeweiligen Vorgehensweise unterscheiden. Im SQL Server ist der sogenannte Soundex-Algorithmus bereits als Funktion implementiert, der beliebig lange Zeichenketten immer auf einen vierstelligen alphanumerischen Code reduziert. Dieses Verfahren wurde von Russell Anfang des 20. Jahrhunderts in den USA entwickelt und ist deshalb für die englische Sprache optimiert.

Besser auf die deutsche Sprache zugeschnitten ist die Kölner Phonetik. Diese bildet nach bestimmten Regeln jeden Buchstaben eines Wortes auf eine Ziffer zwischen 0 und 8 ab. Die Länge des phonetischen Codes ist dabei im Gegensatz zum Soundex nicht beschränkt (s. http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik).

Eine mögliche Implementierung der Kölner Phonetik als Funktion für den SQL Server 2012 soll hier vorgestellt werden.

CREATE FUNCTION [dbo].[SOUNDEX_GER] (@strWord NVARCHAR(1000))
RETURNS NVARCHAR(1000) AS
BEGIN

DECLARE
       @Word NVARCHAR(1000),
       @WordLen int,
       @Code NVARCHAR(1000) = ”,
       @PhoneticCode NVARCHAR(1000) = ”,
       @index int,
       @RegEx NVARCHAR(50),
       @previousCharval nvarchar(1) = ‘|’,
       @Charval nvarchar(1)

   SET @Word = lower(@strWord);
   IF len(@Word) < 1
      RETURN 0;

    — Umwandlung:
    — v->f, w->f, j->i, y->i, ph->f, ä->a, ö->o, ü->u, ß->ss, é->e, è->e, à->a, ç->c

  SET @Word = REPLACE(
                                  REPLACE(
                                        REPLACE(
                                               REPLACE(
                                                      REPLACE(
                                                            REPLACE(
                                                                   REPLACE(
                                                                          REPLACE(
                                                                                 REPLACE(
                                                                                       REPLACE(
                                                                                              REPLACE(
                                                                                                     REPLACE(
                                                                                                            REPLACE(
                                                                                                                  REPLACE(
                                                                                                                         REPLACE(@Word,’v',’f'),
                                                                                                                  ‘w’,'f’),
                                                                                                            ‘j’,'i’),
                                                                                                     ‘y’,'i’),
                                                                                              ‘ä’,'a’),
                                                                                       ‘ö’,'o’),
                                                                                 ‘ü’,'u’),
                                                                          ‘é’,'e’),
                                                                   ‘è’,'e’),
                                                            ‘ê’,'e’),
                                                      ‘à’,'a’),
                                               ‘á’,'a’),
                                        ‘ç’,'c’),
                                  ‘ph’, ‘f’),
                             ’ß’, ‘ss’);

— Zahlen und Sonderzeichen entfernen

       SET @RegEx = ‘%[^a-z]%’;
       WHILE PatIndex(@RegEx, @Word) > 0
           SET @Word = Stuff(@Word, PatIndex(@RegEx, @Word), 1, ”);

— Bei Strings der Länge 1 wird ein Leerzeichen angehängt, um die Anlautprüfung auf den zweiten Buchstaben zu ermöglichen.

    SET @WordLen = LEN(@Word);
    IF @WordLen = 1
        SET @Word += ‘ ‘;

    — Sonderfälle am Wortanfang

    IF (substring(@Word,1,1) = ‘c’)
   BEGIN
— vor a,h,k,l,o,q,r,u,x
            SET @Code =
                  CASE
                      WHEN substring(@Word,2,1) IN (‘a’,'h’,'k’,'l’,'o’,'q’,'r’,'u’,'x’)
                         THEN ’4′
                       ELSE ’8′
                   END;
           SET @index = 2
     END
   ELSE
      SET @index = 1;

— Codierung

   WHILE @index <= @WordLen
   BEGIN
        SET @Code =
           CASE
              WHEN substring(@Word,@index,1) in (‘a’,'e’,'i’,'o’,'u’)
                  THEN @Code + ’0′
              WHEN substring(@Word,@index,1) = ‘b’
                THEN @Code + ’1′
              WHEN substring(@Word,@index,1) = ‘p’
                THEN IIF (@index < @WordLen, IIF(substring(@Word,@index+1,1) = ‘h’, @Code+’3′, @Code+’1′), @Code+’1′)
              WHEN substring(@Word,@index,1) in (‘d’,'t’)
                THEN IIF (@index < @WordLen, IIF(substring(@Word,@index+1,1) in (‘c’,'s’,'z’), @Code+’8′, @Code+’2′), @Code+’2′)
              WHEN substring(@Word,@index,1) = ‘f’
                THEN @Code + ’3′
              WHEN substring(@Word,@index,1) in (‘g’,'k’,'q’)
                  THEN @Code + ’4′
              WHEN substring(@Word,@index,1) = ‘c’
                  THEN IIF (@index < @WordLen, IIF(substring(@Word,@index+1,1) in (‘a’,'h’,'k’,'o’,'q’,'u’,'x’), IIF(substring(@Word,@index-1,1) = ‘s’ or substring(@Word,@index-1,1) = ‘z’, @Code+’8′, @Code+’4′), @Code+’8′), @Code+’8′)
              WHEN substring(@Word,@index,1) = ‘x’
                  THEN IIF (@index > 1, IIF(substring(@Word,@index-1,1) in (‘c’,'k’,'x’), @Code+’8′, @Code+’48′), @Code+’48′)
              WHEN substring(@Word,@index,1) = ‘l’
                THEN @Code + ’5′
              WHEN substring(@Word,@index,1) = ‘m’ or substring(@Word,@index,1) = ‘n’
                  THEN @Code + ’6′
              WHEN substring(@Word,@index,1) = ‘r’
                  THEN @Code + ’7′
              WHEN substring(@Word,@index,1) = ‘s’ or substring(@Word,@index,1) = ‘z’
                  THEN @Code + ’8′
              ELSE @Code
         END;
         SET @index += 1;
     END

— die mehrfachen Codes entfernen und erst dann die “0″ eliminieren
— Am Wortanfang bleiben “0″-Codes erhalten

   SET @index = 0;
   WHILE @index < LEN(@code)
   BEGIN
      SET @charval = SUBSTRING(@code, @index+1, 1);
      IF @charval <> @previousCharval
      BEGIN
        IF @charval <> ’0′ OR @index = 0
         BEGIN
             SET @PhoneticCode += @charval;
         END
     END
     SET @previousCharval = @charval;
     SET @index += 1;
   END
RETURN @PhoneticCode;

END;

Ein möglicher Anwendungsfall für diese Funktion ist die Datenqualitätsverbesserung durch oben angesprochene Dublettenbereinigung im Datenbestand. Hier kann beispielsweise ein manuelles Screening auf Dubletten durch die Fachabteilung bzw. einen Data Steward im Rahmen eines Master Data Management erfolgen.

by Jens Heidrich

↧

Automatic handling of external memory pressure on a SQL-Server

February 11, 2014, 1:27 pm

≫ Next: Parallel Data Warehouse (PDW) and ROLAP

≪ Previous: Implementierung einer phonetischen Suche nach dem Kölner Algorithmus

Imagine the following, we have a SQL-Server. It is neatly equipped with 32 GB of memory, a few cores and enough disc space. There is also the SSIS-Service running on the system with some small loader packages which update the database on a daily basis during the night. At the beginning of production lifecycle the database is small, just up to a few GB. Everything is fine. Time goes by…
After a few month the database is larger than the available memory. The SSIS-Load package needs twice the time due to data growth and suddenly our monitoring tool reports low memory during night loads or even worse the SSIS-Load fails with out of memory errors.
What if we could automatically react on low memory issues and inform the admin by mail so she could react on this issue relaxed the next morning instead of panic reaction during the night because of the failed load.

In this blog I’ll show you how setup an automatic reaction on low available memory and giving the SSIS-Service enough memory to run the load.

To achieve this goal we just have to setup 4 simple steps:

Step: We set MinServerMemory of the SQL-Server to half of the system memory
Step: We set MaxServerMemory to (system memory – 2 GB) here 30 GB (keeping the operating system alive)
Step: We set up a SQL-Server-Agent job to reduce the MaxServerMemory, which gives more memory to the OS and other services
Step: We implement an alert with the reaction to run the SQL-Server-Agent-Job and e-mail the admin

The 4 steps in detail:

1. Step: In this case half server memory is 16 GB. This is the script to set this up:

EXEC sys.sp_configure N’show advanced options’, N’1′ RECONFIGURE WITH OVERRIDE
GO
EXEC sys.sp_configure N’min server memory (MB)’, N’16384′
GO
RECONFIGURE WITH OVERRIDE
GO
EXEC sys.sp_configure N’show advanced options’, N’0′ RECONFIGURE WITH OVERRIDE
GO

2. Step: Set MaxServerMemory to 30 GB

EXEC sys.sp_configure N’show advanced options’, N’1′ RECONFIGURE WITH OVERRIDE
GO
EXEC sys.sp_configure N’max server memory (MB)’, N’3720′
GO
RECONFIGURE WITH OVERRIDE
GO
EXEC sys.sp_configure N’show advanced options’, N’0′ RECONFIGURE WITH OVERRIDE
GO

In both steps the third line needs to adjusted to the correct value (in MB) if your server is equipped with more or less memory.

3. Step: Now let’s setup the SQL-Server-Agent-Job to adjust the MaxServerMemory.

It’s a simple job with only one step. The Step itself does the following: It gets the current Min- and MaxServerMemory, it reduces the MaxServerMemory by 512 MB, checks if we would fall below the MinServerMemory setting (if yes, set 1 MB above) and uses the script of Step 2 to adjust the setting:

DECLARE @currentMaxMem int;
DECLARE @currentMinMem int;
SELECT @currentMaxMem = CAST([value] as int) FROM [master].[sys].[configurations]
WHERE NAME IN (‘Max server memory (MB)’)
SELECT @currentMinMem = CAST([value] as int) FROM [master].[sys].[configurations]
WHERE NAME IN (‘Min server memory (MB)’)

set @currentMaxMem = @currentMaxMem -512

if @currentMaxMem < @currentMinMem
begin
set @currentMaxMem = @currentMinMem +1
end

EXEC sys.sp_configure N’show advanced options’, N’1′ RECONFIGURE WITH OVERRIDE

EXEC sys.sp_configure N’max server memory (MB)’, @currentMaxMem RECONFIGURE WITH OVERRIDE

EXEC sys.sp_configure N’show advanced options’, N’0′ RECONFIGURE WITH OOVERRIDE

The complete Job looks scripted like that way:

/****** Object: Job [AdjustMaxServerMemory] ******/

BEGIN TRANSACTION

DECLARE @ReturnCode INT
SELECT @ReturnCode = 0
/****** Object: JobCategory ******/
IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N’[Uncategorized (Local)]‘ AND category_class=1)
BEGIN
EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N’JOB’, @type=N’LOCAL’, @name=N’[Uncategorized (Local)]‘ IF (@@ERROR <> 0 OR @ReturnCode <> 0)
GOTO QuitWithRollback
END
DECLARE @jobId BINARY(16)
EXEC @ReturnCode = msdb.dbo.sp_add_job @job_name=N’AdjustMaxServerMemory’,
@enabled=1,
@notify_level_eventlog=0,
@notify_level_email=0,
@notify_level_netsend=0,
@notify_level_page=0,
@delete_level=0,
@description=N”,
@category_name=N’[Uncategorized (Local)]‘,
@owner_login_name=N’sa’,
@job_id = @jobId OUTPUT

IF (@@ERROR <> 0 OR @ReturnCode <> 0)
GOTO QuitWithRollback
/****** Object: Step [Adjust the memory] ******/
EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId,
@step_name=N’Adjust the memory’,
@step_id=1,
@cmdexec_success_code=0,
@on_success_action=1,
@on_success_step_id=0,
@on_fail_action=2,
@on_fail_step_id=0,
@retry_attempts=0,
@retry_interval=0,
@os_run_priority=0,
@subsystem=N’TSQL’,
@command=N’DECLARE @currentMaxMem int;
DECLARE @currentMinMem int;
SELECT @currentMaxMem = CAST([value] as int) FROM [master].[sys].[configurations] WHERE NAME IN (”Max server memory (MB)”)
SELECT @currentMinMem = CAST([value] as int) FROM [master].[sys].[configurations] WHERE NAME IN (”Min server memory (MB)”)
set @currentMaxMem = @currentMaxMem -512
if @currentMaxMem < @currentMinMem
begin set @currentMaxMem = @currentMinMem +1
end
EXEC sys.sp_configure N”show advanced options”, N”1” RECONFIGURE WITH OVERRIDE
EXEC sys.sp_configure N”max server memory (MB)”, @currentMaxMem
RECONFIGURE WITH OVERRIDE
EXEC sys.sp_configure N”show advanced options”, N”0” RECONFIGURE WITH OVERRIDE
‘,
@database_name=N’master’,
@flags=0
IF (@@ERROR <> 0 OR @ReturnCode<> 0)
GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
IF (@@ERROR <> 0 OR @ReturnCode <> 0)
GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId,
@server_name = N’(local)’
IF (@@ERROR <> 0 OR @ReturnCode <> 0)
GOTO QuitWithRollback
COMMIT TRANSACTION
GOTO EndSave
QuitWithRollback:
IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION
EndSave:
GO

4.Step: Now here is the main attraction, monitoring the available memory with WMI and the SQL-Server-Agent via Alerts

SQL-Server-Agent has a build-in feature called “Alerts”. With it, we can setup some kind of monitoring for Performance-Counter thresholds or WMI-Alerts. You can find the Alerts here:

Right click on Alerts and “New Alert”. Let’s give it a useful name like “MaxServerMemory”. What we want be alerted on is “Available memory less than 2 GB” as this would state memory pressure on Windows and other services. So we use a WMI-event alert with the following query:

SELECT * FROM __InstanceModificationEvent WITHIN 300
WHERE TargetInstance ISA “Win32_PerfFormattedData_PerfOS_Memory” AND TargetInstance.AvailableMBytes < 2048

So, what are we doing? We use the InstanceModificationEvent within 300 seconds to check the Available MB.
When the AvailableMBytes-Value falls below 2 GB the event is fired.
Now we need to configure the response of this alert and what should it be? Correct we start the job created in step 3:

And that’s it.

The Alert checks every 300 seconds (5 min, feel free to adjust whatever you need) for the available memory keeping in mind windows needs some memory and the SSIS-Load, too. If we fall below that value we start the job to reduce the value of MaxServerMemory and the SQL-Server will free up some space for other services.

Additionally we could e-mail someone to let him know the alert was fired and the next morning she can react on this issue. But anyway we make sure the SSIS-Load can run successfully.

By the way, setting up this reaction is independent of any monitoring tool as it just makes sure the SQL-Server gives some memory to the other services.

by Stefan Grigat

↧

Parallel Data Warehouse (PDW) and ROLAP

February 16, 2014, 3:24 am

≫ Next: Error while writing with DWLoader – Database Full

≪ Previous: Automatic handling of external memory pressure on a SQL-Server

PDW 2012 | SQL Server 2012 | SQL Server 2014

This post is about using the Parallel Data Warehouse as a ROLAP source for SSAS. For PDW v1 this wasn’t recommended but the quintessence of this post is, that this really works well with PDW 2012. In fact, this is the first time I saw MOLAP performance on large ROLAP tables (over a billion rows) and again, another big plus for the PDW with the column store index. In fact, I’m really excited about this (and I’ll tell you why in a minute), but maybe I wasn’t loud enough. So here again:

“Using SSAS ROLAP with PDW 2012 is
working really well!!!”

But, and I have to lower my voice again, I have to agree with Chris Webb that there is almost no information about it out there. So enough reason to write about this truly amazing story.

Before I’m going into some relevant topics, let me briefly recap the benefits of ROLAP against MOLAP:

	LOW LATENCY	No need to process MOLAP partitions: low latency (data in the relational data warehouse tables are immediately available to the end users)
	NO/LESS STORAGE REQUIRED	The ROLAP cube only contains the model, not the data. Therefore almost no disk space is required for storing the cube. It’s just the presentation of the model. The MOLAP/ROLAP is a technical implementation issue which is not visible to the end user. For both options, the end user gets an easy to use, highly interactive quick responding data model, which can be used from many tools including Excel pivot tables, Reporting Services, Tableau and other advanced analytical frontend tools.
	LOWER PROJECT COSTS	No need to design and maintain partitions in the cube (see remarks regarding partitioning below): less development and maintenance afford (for example for daily delta updates)
	MORE FLEXIBLE	In MOLAP, many changes to a dimension require a full processing of the dimensions which results in all the attached measure group partitions switching to the ‘unprocessed’ state and need to be processed again. If you have a large cube, this process could take many hours. In ROLAP, all this is not necessary. Changes to cube dimensions are online immediately.
	EASY DEPLOYMENT	Development, testing and deployment to production is much easier since the data is immediately available to end users
	SUPPORTS LARGE DIMENSIONS	Large dimensions (with many million rows) are difficult to handle for MOLAP SSAS. Processing takes a long time and query performance may go down. But ROLAP works well with large dimensions.
	SUPPORTS VERY LARGE FACT TABLES	MOLAP cube sizes of 4 or 5 TB are possible and due to the compression in the cube storage, this corresponds to fact table sizes of 50 TB and more. However, if you go beyond, there is a point where only ROLAP cubes can solve the amount of data.

So there are many advantages when using ROLAP partitions in SSAS. However, there always was a big disadvantage:

BAD PERFORMANCE

Poor query performance for ROLAP partitions compared to MOLAP partitions.

Now, with the memory optimized column store index, especially with the parallel query engine of the PDW, you can get an incredible good query performance from ROLAP partitions. Therefore, we have to cross out this disadvantage:

~~BAD PERFORMANCE~~
GOOD PERFORMANCE

~~Poor query performance for ROLAP partitions compared to MOLAP partitions.~~
With column store index, ROLAP partitions are really fast

And since column store index is also available on SQL Server 2012 (non-clustered, read-only) and 2014 (clustered, updatable) this should also apply to the SMP SQL Server (I haven’t tested it out with huge amounts of data though).

Here are some remarks/recommendations if you’re planning ROLAP on PDW:

Clustered columnstore index

As mentioned above, the clustered column store index of the PDW is the key to using ROLAP on PDW 2012 and maybe the most important reason why ROLAP is now a reliable option on PDW at all. So make sure, your (fact-) tables are stored in clustered column store mode.

Fast network connection between PDW and Analysis Services

Obviously, a fast network connection between the SSAS server and the PDW is important to get a good performance. Of course this is also true for MOLAP or mixed environments. As of today, I would recommend to add the SSAS server to the appliance’s infiniband network.

Table layout: distributed/replicated

Most of the ROLAP queries will basically look like

select Dim1.Property1, Dim2.Property2, Sum(Fact1.Amount) SumOfAmount
from Fact1
inner join Dim1 on Fact1.Dim1Key=Dim1.Dim1Key
inner join Dim2 on Fact1.Dim2Key=Dim2.Dim2Key
group by Dim1.Property1, Dim2.Property2

In order to have queries like this respond well, the tables should be distribution-compatible. In many cases you can achieve this by turning the dimension tables into replicated tables. I have more detailed explanation on distribution and aggregation compatibility in some older posts and there is also a good post by Stephan Köppen about this topic here. An incompatible distribution when joining two large fact tables (for example a fact table with a many-to-many bridge table) results in shuffle move or even broadcast move operations, that are also fast, but not lightning fast as you would expect for online analytical applications. So my recommendation is to carefully choose the distribution keys so that the required joins can be resolved locally. Aggregation compatibility is more difficult to achieve for all types of queries. However, from my current experience PDW responded very fast even if the query was not aggregation compatible.

Partitions

Having a large amount of data in MOLAP or ROLAP cubes usually requires partitioning of the MOLAP/ROLAP measure groups. For MOLAP, recommendations vary from about 20-50 million rows per partition. Therefore storing a billion rows results in at least 20-50 partitions. In practical scenarios you often end up with many more partitions in order to implement daily incremental loading. But because PDW is designed to run large queries it’s much better to use only one partition, instead of firing a small-shot charge of queries to the appliance. Internally the PDW uses distributions in order to run the query using all cores in parallel, so there is no need to create partitions for performance reasons.
By the way, since many small queries require more computing power then a few complex queries you should be careful with Excel’s pivot option “convert to formula”…

Distinct Count

As described in this blog post by Anthony Mattas (and in many other posts), you should set EnableRolapDistinctCountOnDataSource in the Analysis Services properties in order to compute the distinct count calculation on the database instead of fetching the distinct rows to Analysis Services.

Please note that this property is not yet available in the server properties but must be set manually in the msmdsrv.ini file (which can be found below the instance in the OLAP\Config sub directory).

Having all your table statistics up to date

This generally is a very important thing when working with the PDW, not only when using ROLAP. While the compute nodes have auto create/auto update enabled, statistics are not (apart from very few cases) automatically created/updated on the control node. Without proper statistics, PDW cannot create an optimal distributed query plan. The simple thing is, that for most cases, where performance is an issue with PDW, incorrect statistics are the root cause.

Dealing with large dimensions

In some cases, having a measure group with many attached dimensions can cause problems if those dimensions are referenced in the query (on rows, columns, filter). I’m currently trying to narrow this down, but one possible reason could be the missing primary key constraints on the PDW together with large dimensions. Consider this simple query:

select P.Color, Sum(S.SalesAmount) SumOfSalesAmount
from [dbo].[FactInternetSales] S
inner join [dbo].[DimProduct] P on S.ProductKey=P.ProductKey
group by P.Color

If you have a primary key on the dbo.DimProduct.ProductKey, the optimizer knows that the inner join cannot produce more rows than exist in the fact table because for each row from the fact table we can only find at most one row in the dimension table. Without the primary key (which is the situation in the PDW) the optimizer has to consider density information from the statistics. This will work pretty well, but let’s say that for a larger dimension the statistics gives something like: “for each row from the fact table, you might be getting 1.3 rows from the dimension table”. Again, nothing much happened here. But assuming you have many dimensions, the effect may grow exponentially. With 8 dimensions and 30% over guess you would end up at 1.3⁸ = 8.16. So instead of querying for example a billion rows, the optimizer thinks that we’re about to query 8 billion rows. This could have a huge effect on the query plan. If you encounter such issues, one option could be to convert the dimensions in the data source view to query binding. For example, the query for the product dimension may look like this:

select
ProductKey,
Min(Color) Color,
Min(EnglishProductName) EnglishProductName,
Min(ListPrice) ListPrice
…
from
DimProduct
group by ProductKey

Since ProductKey is actually a logical primary key, rewriting the dimension query this way gives the same result as

select ProductKey, Color, EnglishProductName, ListPrice from DimProduct

but because of the group by operation, the optimizer now know for sure, that the ProductKey is unique thus giving a better estimate of the resulting rows.

Again, I’m still investigating these cases and the benefit of the query rewrite, but if you encounter performance issues, this may be one option to try.

Aggregation design and proactive caching

Since ROLAP partitions rely on indexed views in the data source, you cannot use ROLAP aggregations on the PDW 2012. However, from the query response we got so far, there might not be much need for aggregations at all if your data is stored as a clustered columnstore index. If you need aggregations, you could try HOLAP aggregations. We haven’t tried this so far, but I’m planning to do more investigation.

Proactive caching is currently only supported in polling mode (not with the trace mechanism).

Since we’re still in the process of adjusting our PDW ROLAP environment I’m going to write more posts with tips and tricks, so stay tuned. Actually, we’re investigating these topics:

How does ROLAP perform with role based security in the SSAS cube?
How does ROLAP perform with many users?
How does ROLAP work with more complicated MDX calculations involving PARALLELPERIOD, aggregates (AGGREGATE, SUM, MIN, MAX etc.) over dynamic sets etc.? Which MDX calculations are performing better / for which calculations shall we still use MOLAP?

Also some interesting recommendations (for example regarding the “count of rows” aggregation or the proper setup of your Analysis Services server) can be found in the SQLCAT Whitepaper Analysis Services ROLAP for SQL Server Data Warehouses.

by Hilmar Buchta

↧

Error while writing with DWLoader – Database Full

February 27, 2014, 3:34 am

≫ Next: ETL or ELT… or both??

≪ Previous: Parallel Data Warehouse (PDW) and ROLAP

It has been a while but I think this might be very useful information and something Microsoft should work on (yes, it has been addressed).

Anyway – back in time when I created the Database for our production system i wasn’t really thinking about the size it is going to have in the near future and time went by without any problems. But at one point DWLoader gave us a error message which I couldn’t really interpret. Unfortunately I didn’t happen for all loading processes but only for some. Of course not being able to write to the production system was a huge a problem so everything possible was done to figure out where the problem was.

Below you find the actually dwloader statements with the row information (partially) and the error message.

dwloader.exe -U USER -P PASS -r “\r\n” -fh 2 -t 0×03 -rt value -rv 0 -e UTF8 -T [Test].[dbo].[TABLE] -i “load.csv” -R “reject.tmp” -E -M fastappend -m
[2014-02-27 11:49:31] Warning – Multiple transactions setting is set to true..
[2014-02-27 11:49:31] Starting Load
[2014-02-27 11:49:31] Load has started
[2014-02-27 11:49:31] Status: Running, Run Id: 61194 – Total Rows Processed: 0, Total Rows Rejected: 0
[2014-02-27 11:49:33] Status: Running, Run Id: 61194 – Total Rows Processed: 80736, Total Rows Rejected: 0
[2014-02-27 11:49:33] Status: Running, Run Id: 61194 – Total Rows Processed: 171564, Total Rows Rejected: 0
[2014-02-27 11:49:33] Status: Running, Run Id: 61194 – Total Rows Processed: 252300, Total Rows Rejected: 0
[2014-02-27 11:49:33] Status: Running, Run Id: 61194 – Total Rows Processed: 333036, Total Rows Rejected: 0
[2014-02-27 11:49:33] Status: Running, Run Id: 61194 – Total Rows Processed: 413772, Total Rows Rejected: 0
[2014-02-27 11:49:33] Status: Running, Run Id: 61194 – Total Rows Processed: 494508, Total Rows Rejected: 0
[2014-02-27 11:49:33] Status: Running, Run Id: 61194 – Total Rows Processed: 565152, Total Rows Rejected: 0
[2014-02-27 11:49:33] Status: Running, Run Id: 61194 – Total Rows Processed: 635796, Total Rows Rejected: 0
[2014-02-27 11:49:37] Status: Running, Run Id: 61194 – Total Rows Processed: 857820, Total Rows Rejected: 0
[2014-02-27 11:49:37] Status: Running, Run Id: 61194 – Total Rows Processed: 1069752, Total Rows Rejected: 0
[2014-02-27 11:49:37] Status: Running, Run Id: 61194 – Total Rows Processed: 1301868, Total Rows Rejected: 0
[2014-02-27 11:49:37] Status: Running, Run Id: 61194 – Total Rows Processed: 1513800, Total Rows Rejected: 0
[2014-02-27 11:49:37] Status: Running, Run Id: 61194 – Total Rows Processed: 1735824, Total Rows Rejected: 0
[2014-02-27 11:49:37] Status: Running, Run Id: 61194 – Total Rows Processed: 1947756, Total Rows Rejected: 0
[2014-02-27 11:49:37] Status: Running, Run Id: 61194 – Total Rows Processed: 2159688, Total Rows Rejected: 0
[2014-02-27 11:49:37] Status: Running, Run Id: 61194 – Total Rows Processed: 2391804, Total Rows Rejected: 0
[2014-02-27 11:49:43] Status: Running, Run Id: 61194 – Total Rows Processed: 2805576, Total Rows Rejected: 0
[2014-02-27 11:49:43] Status: Running, Run Id: 61194 – Total Rows Processed: 3209256, Total Rows Rejected: 0
…
…
…
[2014-02-27 11:50:05] Status: Running, Run Id: 61194 – Total Rows Processed: 21828996, Total Rows Rejected: 0
[2014-02-27 11:50:05] Status: Running, Run Id: 61194 – Total Rows Processed: 21909732, Total Rows Rejected: 0
[2014-02-27 11:50:05] Status: Running, Run Id: 61194 – Total Rows Processed: 21990468, Total Rows Rejected: 0
[2014-02-27 11:50:05] Status: Running, Run Id: 61194 – Total Rows Processed: 22071204, Total Rows Rejected: 0
[2014-02-27 11:50:05] Status: Running, Run Id: 61194 – Total Rows Processed: 22162032, Total Rows Rejected: 0
[2014-02-27 11:50:05] Status: Aborted, Run Id: 61194 – Error Code: 110802 – Message: An internal DMS error occurred that caused this operation to fail
. Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Workers.DmsSqlNativeException, Message: SqlNativeBufferBufferBulkCopy.WriteToServ
er, error in OdbcWriteBuffer: SqlState: , NativeError: 0, ‘Error calling: bcp_batch(pConn->GetHdbc()) | SQL Error Info: SrvrMsgState: 0, SrvrSeverity:
0, | Error calling: pBcpConn->WriteBuffer(pBuffer, bufferOffset, bufferLength, pRowsWritten) | state: FFFF, number: 1157368, active connections: 8′,
Connection String: Driver={SQL Server Native Client 10.0};APP=DmsNativeWriter:PDWT1-CMP02\sqldwdms (5260) – ODBC;Trusted_Connection=yes;AutoTranslate
=no;Server=PDWT1-SQLCMP02.PDWT1.dwpu.local,1502
[2014-02-27 11:50:05] Multiple Transaction support is ON. There might be inconsistencies with the data.
[2014-02-27 11:50:06] Load has Failed

To me the actually problem was not really visible and it took quite a bit of effort to figure out where the issue is, but after analyzing several log files I found something really interesting.

Error: 1101, Severity: 17, State: 12.
Could not allocate a new page for database ‘DB_fc977c0ed10d43ffab762d9a17230c21′ because of insufficient disk space in filegroup ‘DIST_C’. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup

After changing the size and setting autogrow to one the problem was solved. Unfortunately the error message is not really useful and a lot of digging had to be done. I hope to see this changed at some point.

by Stephan Köppen

↧

ETL or ELT… or both??

March 2, 2014, 6:12 am

≫ Next: Improved Waterfall Chart with SSRS

≪ Previous: Error while writing with DWLoader – Database Full

PDW 2012 | SQL Server 2005-2014

With database servers getting more and more powerful, some traditional concepts of Business Intelligence solution may be reconsidered. One of those concepts for which you can find a lot of lively debates recently is the question of whether to use ETL or ELT.

Here are just a few of the blog posts and discussions you can find on this topic:

In short, the main differences are shown in the table below:

	ETL	ELT
Characteristics	ETL=Extract-Transform-Load Transformation done in the ETL tool (data flow pipeline), only the finally prepared data is loaded to the data warehouse data base	ELT=Extract-Load-Transform Original data is loaded into database, then SQL is used to transform the data into the dimensional model
Pros	Well established ETL tools available with debugging, build in logging, configuration, error handling and process reporting and statistics comprehensive and easy to maintain data flow makes it easy to merge data from different sources, use of data quality tools and individually deal with error rows.	ELT fully benefits from the database power, query optimizer and so on. Especially for MPP environments (like the PDW): Scaling the database means scaling up for ELT process perfromance as well SQL code is easier to maintain in source control systems (like TFS) than ETL packages (complex XML).
Cons	ETL pipeline tools support multiple cores, but parallel IO has to be solved programmatically: you have to do something (for example use the balanced data distributor SSIS component or merge multiple inputs from the same source table) ETL tools are build for row based operations. Operations that need to be performed on a set of rows (like sort and aggregate or calculations covering multiple rows) are harder to solve. I wrote some posts recently about ELT calculations that are relatively difficult to solve in ETL.	SQL is harder to read, to structure and to document compared to ETL packages You need discipline as minor wrongness may lead to errors that are hard to track down (e.g. too many resulting lines from join operation if a key is missing in the join)

This comparison is by far not complete and if you read the links above (and many others that target this topic) you can find a lot more pros/cons and opinions. In fact, I don’t want to say one is better than the other. But here is what we recently found to work well in a large project using the Parallel Data Warehouse (PDW) for an initially 30TB (and growing) database. The following illustration which I recently used on SQL Conference 2014 in Germany shows the key concepts:

We’re using ETL (Microsoft SQL Server Integration Services) to

Orchestrate the load process
- workflow management (make sure the right things happen in the right order)
- dealing with technical challenges (e.g. temporary tables, partition switching on PDW)
- implement configuration management (for example server and database names)
- logging and process monitoring (reports)
Load dimensions (small amount of data)
- Collecting master data from source systems, merge and prepare this data
- Generation of surrogate keys for the dimensions
- Keeping track of historical changes (modeled as SCD2 or intermediate bridge tables)
- Building up all dimensions and transferring the data to the PDW (using a reload operation)
- Early arriving facts (create missing dimension rows, distinct counts run on PDW)

Why?

Integration Services (SSIS) well suited for these tasks
SMP SQL Server offers good support for dimension processing tasks (identity column, T-SQL merge statement etc.)
Additional services like SQL Server Data Quality Service (DQS) and SQL Server Master Data Services (MDS) are currently not supported to run on the PDW
This is also true for more sophisticated tasks and the use of web services for example to find duplicate customers,to correct misspelled street names, to guess the gender from the first name. Also if you need to use custom assemblies, for example to access special source systems or include specific calculations, ETL tools are the better choice.

Then, we’re using ELT (distributed SQL, DSQL) on the Parallel Data Warehouse to

process fact data (large amount of data) after it is bulk loaded with no modifications into a staging database on the PDW
- Data preparation (for example removing duplicate rows)
- Linking fact table data to dimensions
- Performing calculations (using SQL window functions intensively)
Merge new data to archive
- store the data in the persisted stage area (without creating duplicates if the data was there already)

Why?

Much better performance observed compared to SMP SQL Server/SSIS
- in our case, usually about 10-20 times faster, depending on the source data and the transformations
- In some cases (for example removing duplicate rows in the source data) even 100 times faster
Faster loads allow us to fully reload many TB in case this is needed (this gives more options for backup strategies and for the dimensional modeling)
Solution will directly benefit from future MPP scale up without any need of coding

CONCLUSION

ETL and ELT may work well together. In this scenario we did the dimension processing as well as the full workflow management using ETL tools (SSIS on an SMP SQL Server) and the processing of the large transactional tables using ELT (distributed SQL on PDW).

by Hilmar Buchta

↧

Improved Waterfall Chart with SSRS

March 12, 2014, 5:34 am

≫ Next: Converting events to hourly based aggregations

≪ Previous: ETL or ELT… or both??

Some time ago, I have received the task to develop a Waterfall Chart with Microsoft’s Reporting Services. The requirements were very specific and required much more than a simple SSRS standard chart. At the end the result made everyone happy, which is why I want to introduce you my approach.

Establish the basis

For my example we need a dataset, let’s call it dsSales, as our data source. For this purpose we use the following query.

SELECT ProductCategoryName, SalesAmount
FROM
( 
    VALUES ('Components' , 577.13)
         , ('Accessories', 103.77)
         , ('Bikes'      , 865.08)
         , ('Clothing'   , 118.84)
         , ('Other Vehicles'   , -292.16)
) Sales(ProductCategoryName, SalesAmount);

The result of the query looks like this.

We use a Range Column Chart as basis, which is linked to the dataset dsSales. The column SalesAmount is used initialy for our data series. As Category Group, we select the column ProductCategoryName. And for better illustration, Chart title, Axis Title and Legend can be hidden.

The chart should now look similar to this.

The Waterfall Chart

Let’s take a look at our SalesAmount data series.

A Range Column Chart has two value parameter, a high value, currently assigned with Sum(SalesAmount), and a bottom value, which is currently assigned with nothing.

Now how do we get our Waterfall Chart? What we wane do is, to move the current Range slightly upward or downward depending on its predecessors.

One possibility is to calculate the running total through expressions. For this we use the function RunningValue. I have outlined the approach for the purpose of a better understanding below.

As a result we’ll get these two expressions for the higher and the lower value.

High Value = RunningValue(Fields!SalesAmount.Value, Sum, Nothing)
Low Value = RunningValue(Fields!SalesAmount.Value, Sum, Nothing)
                - Sum(Fields!SalesAmount.Value)

If we look at the preview, the report should now look similar to this.

Improving the Waterfall Chart

Now we will add some reference lines to improve the presentation of our Waterfall Chart. These lines are used to illustrate the dependency to the previous product category and allow the viewer a better visual understanding.

As preparation, the horizontal gridlines are hidden. The width of the SalesAmount data series can be further reduced by changing the CustomAttribute Property, PointWidth to 0.6.

Now we come to the said reference lines. For this purpose we add another data series, choosing again SalesAmount as our source column. SSRS creates automatically another Range Column Chart, naming it SalesAmount1. Via the context menu, we can change the chart type into a Stepped Line Chart. By simply dragging and dropping the new Stepped Line Chart upwards until it comes first we can ensure that it is drawn behind the SalesAmount Ranged Column Chart.

Also for this data series, the Y value must be changed. All we need to do is, to use the same expression as for the high value of our Range Column Chart.

Y Value = RunningValue(Fields!SalesAmount.Value, Sum, Nothing)

At last, we’ll adjust some properties for both charts for presentation purpose. Color and BorderColor is set to Silver. For the Range Column Chart, BorderStyle is set to Solid, for our Stepped Line Chart to Dashed.

The Border around the Range Column Chart is needed because otherwise, we will see an unwanted artifact like a small step where the line disappears behind the range.

The result looks like this.

Adding a total sum

Now that we have our Waterfall Chart we can make further modifications. What about a total sum? To make it easy we would need to add a separate data series, too.

First we modify the query from the dataset dsSales.

WITH Sales
AS
(
    SELECT ProductCategoryName, SalesAmount
    FROM
    ( 
        VALUES ('Components' , 577.13)
             , ('Accessories', 103.77)
             , ('Bikes'      , 865.08)
             , ('Clothing'   , 118.84)
             , ('Other Vehicles'   , 118.84)
    ) Sales(ProductCategoryName, SalesAmount)
)
SELECT *
FROM
(
        SELECT 1, ProductCategoryName, SalesAmount, NULL
        FROM Sales
    UNION ALL
        SELECT 2, 'Sum total', NULL, SUM(SalesAmount)
        FROM Sales
) Result(CategorySortId, CategoryLabel, SalesAmount, TotalSalesAmount);

The result of the query looks like this.

What has changed?

TotalSalesAmount contains the desired total in a separate line.
CategorySortId is a new column that is used for sorting, because the total sum should be the last one on the chart.
ProductCategoryLabel was renamed into CategoryLabel because it no longer exclusively contains identifiers for the product categories.

Because of the changes we have made on the dataset the chart needs to be updated too. The Category Group ProductCategoryName must be replaced by CategoryLabel.

The sorting is adjusted within the Category Group Properties to ensure the right order. First, it is sorted by CategorySortId, followed by CategoryLabel.

What is missing is the desired total sum. For this purpose, a new data series is added by selecting TotalSalesAmount. Setting the CustomAttribute Property, DrawSideBySide to false prevents that data points with equal x value are drawn side by side as the name might suggest. Also, the width is reduced again through PointWidth to 0.6.

The final result looks like this.

Summary

What I have demonstrated to you is how you can create another chart type, which is not offered as a template, through existing standard chart types and some expressions. By combining different chart types, it is also highly customizable and extensible. In addition, Data labels can be added. The background color can be adjusted depending on its value. Unusual or special data points can be highlighted. Whatever is necessary to increase the meaningfulness of the observed information.

Still a foretaste to further stimulate your Fantasy.

by Jan Köhler

↧

Converting events to hourly based aggregations

March 16, 2014, 4:09 am

≫ Next: Discover missing rows of data

≪ Previous: Improved Waterfall Chart with SSRS

PDW 2012 | SQL Server 2012

For today’s post I didn’t find a really good title. Here’s what this post is about: Sometimes you’ll find event based data in your source system (something happens at a specific point in time) but for the data warehouse you want to transform this data to match a given time dimension. The situation is similar to an older post I wrote about SQL Server window functions.

There are some approaches to accomplish this task. For today’s post I’d like to show a SQL-based approach we’re currently using in an ELT process on a Parallel Data Warehouse (PDW) 2012.

Let’s assume you’re working for a retailer who is interested in the number of cash desks being open at a time. A subset of the event data (single day, single store) may look like this:

In this case cash desk 2 opened at 06:35, then desk 1 opens at 8:27, then desk 2 closes at 11:58 and so on. The question is, how many cash desks are open for example from 08:00 to 09:00? If a cash desk is only open for half an our in the given time range, it should be counted as 0.5, so between 08:00 and 09:00 approximately 1.5 cash desks where open (desk 2 for the full hour and desk 1 for half of an hour).

In order to get the number of cash desk being open, we first convert the transaction type to a delta: +1 means a desk opens, –1 means a desk closes. Here is the query together with the result:

select
TransactionDate
, StoreID
, TransactionTime
, CashdeskID
, case TransactionType
when ‘signon’ then 1
when ‘signoff’ then -1
else 0
end CashdeskDelta
from (
select TransactionDate, StoreID, TransactionTime, CashdeskID,TransactionType
from POSData where TransactionType in (‘signon’,'signoff’)) P
order by 1,2,3

The result (again a subset) may look like this:

After transforming the transaction type to a numeric value, we can aggregate it using a window function. Therefore I’m using the query from above as a sub query:

select *,
sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime]) OpenCashdesks,
datediff(s,[TransactionTime],lead([TransactionTime],1) over (partition by TransactionDate, StoreID order by [TransactionTime])) TimeDelta
from
(
– query from above —
) CheckoutDetails
order by 1,2,3

Again, this shows the power of the window functions. The query gives us the number of open cash desks together with the number of seconds to the next event.

For example, from 8:27 to 11:58, 12622 seconds passed. During this time, 2 cash desks were open. This is a huge step towards the solution but we still have no hour based time frame in the data. However, this can easily be created by cross joining the dimensions for store and time. For my example, I have no store or time dimension (as you should usually have), so I’m using the table sys.all_objects here to generate a sufficient number of data rows:

with
Timeslot AS (
select T2.TransactionDate, T2.StoreID, T1.TimeGrid from
(select top 24 convert(time,dateadd(hour,row_number () over (order by [name])-1,’00:00:00′)) TimeGrid from sys.all_objects) T1
cross join
(select distinct TransactionDate, StoreID from POSData) T2
)

The query creates one row per hour for each store and each date. Again, usually you would use your existing dimension tables instead of the sys.all_objects table here.

Now, let’s bring both queries together:

with

Timeslot AS
(
.. – timeslot query from above
),

CashDeskTimesDetails as
(
select *,
sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime])
CashdesksOpen
,sum(CashdeskDelta) over (partition by TransactionDate, StoreID order by [TransactionTime])*
TimeDeltaSeconds CashdeskOpenSeconds
,convert(time, dateadd(hour, datediff(hour, 0, TransactionTime),0)) TransactionHour
from
(
select
TransactionDate
, StoreID
, TransactionTime
, coalesce(
datediff(s,[TransactionTime],lead([TransactionTime],1) over (partition by
TransactionDate, StoreID order by [TransactionTime]))
,
datediff(s,[TransactionTime],dateadd(day,1,0)) — fill seconds to end of day
) TimeDeltaSeconds
, CashdeskID
, case TransactionType
when ‘signon’ then 1
when ‘signoff’ then -1
else 0
end CashdeskDelta
from (
select TransactionDate, StoreID, TransactionTime, CashdeskID,TransactionType from
POSData where TransactionType in (‘signon’,'signoff’)
union all
select TransactionDate, StoreID, TimeGrid, 0, ‘timeslot’ from Timeslot
) P
) CheckoutDetails
)
select * from CashDeskTimesDetails
order by 1,2,3

The result shows the original data together with the fixed time frame (24 hours).

Some things to pay special attention to:

The inserted timeslots are created with a Cashdesk delta value 0, so they do not modify the number of open/closed desks (column CashdeskOpen)
In blue: the last time slot has no subsequent timeslot. Therefore the window function returns null. Here we override this with the number of seconds until day’s end.
In red: We add the base hour to each row. This will be used for a group-by operation in the following step

Finally, we simply need to aggregate the last query result:

select
TransactionDate,TransactionHour,StoreID,
Sum(convert(float,CashdeskOpenSeconds)) / Sum(convert(float,TimeDeltaSeconds)) CashdeskCount
from CashDeskTimesDetails
where TimeDeltaSeconds<>0
group by TransactionDate,TransactionHour,StoreID
order by 1,2,3

Here is the final result for the sample data subset from above:

Removing all filters (all dates, all stores) may result in a graph like this:

So this post showed how to transform event based data to a fixed time scale (hours in this case) to match a star schema join to the time dimension. Since we only used SQL this process can be easily used in an ELT loading scenario.

by Hilmar Buchta

↧

Discover missing rows of data

April 7, 2014, 3:30 am

≫ Next: Using Group By and Row Number in SQL statements

≪ Previous: Converting events to hourly based aggregations

PDW 2012 | SQL Server 2012 | SQL Server 2014

If your source data contains a subsequent number without gaps it’s relatively easy to find out if data rows are missing. The approach I’m showing here uses window functions that are available since SQL Server 2012 and SQL Server Parallel Data Warehouse 2012.

In order to have some sample data for this post, I’m using the FactInternetSales table of the AdventureWorksDW2012 database. Let’s pretend the column SalesOrderNumber of that table should not have any gaps. I convert the column data to a numeric type and use only the rows having line item sequence number equal to 1 for my sample data.

SELECT
SalesOrderNumber,
convert(int, substring(SalesOrderNumber,3,255)) SalesOrderIntNumber
FROM [FactInternetSales]
WHERE [SalesOrderLineNumber]=1
ORDER BY SalesOrderNumber

Usually the order number is sequentially but we find some gaps here. For example, the order following order number SO43842 is SO43918, so there are 43918 – 43842 – 1 = 75 rows missing.

Using window functions and a sub query, we can add the next number as a column to the query together with the distance:

select *, NextSalesOrderIntNumber-SalesOrderIntNumber-1 MissingRows
from
(
select
SalesOrderIntNumber,
lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber
from
(SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))
SalesOrderIntNumber FROM [FactInternetSales] where [SalesOrderLineNumber]=1
) TransactionData
) TransactionDataSequence

As you can see, the 75 missing rows are now being reported correctly by the query. The only task left to do now is to aggregate the amount of missing rows by replacing the outer query like this:

select Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1) MissingRows
from
(
select
SalesOrderIntNumber,
lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber
from
(SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))
SalesOrderIntNumber FROM [FactInternetSales] where [SalesOrderLineNumber]=1
) TransactionData
) TransactionDataSequence

As a quality measure you could show the ratio of the missing rows to the total rows (or 100% minus this ratio as a data completeness measure) and – assuming that the missing rows had an average sales amount – also the estimated missing amount. And it’s also useful to get the result on more granular level, for example per month. Here is the full query:

select

orderdatekey/100 [Month],

Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1) MissingRows,

convert(float,Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1))/count(*)
MissingRowsRatio,

convert(float,Sum(NextSalesOrderIntNumber-SalesOrderIntNumber-1))/count(*)
* Sum([ExtendedAmount]) MissingRowsEstimatedValue

from
(
select
SalesOrderIntNumber,
lead(SalesOrderIntNumber,1) over (order by SalesOrderIntNumber) NextSalesOrderIntNumber,
[ExtendedAmount], OrderDateKey
from
(SELECT SalesOrderNumber, convert(int, substring(SalesOrderNumber,3,255))
SalesOrderIntNumber, [ExtendedAmount], OrderDateKey
FROM [FactInternetSales] where [SalesOrderLineNumber]=1
) TransactionData
) TransactionDataSequence

group by orderdatekey/100
order by orderdatekey/100

Plotting the result over the time gives a good overview. For my example data, quality improved a lot since August 2007.

Conclusion: This is another example how window functions provide an elegant solution for solving analytical data tasks. And since this works perfectly on a PDW, the approach works well even with billions of rows of data.

by Hilmar Buchta

↧

Using Group By and Row Number in SQL statements

May 12, 2014, 3:11 pm

≫ Next: Practical table partitioning on the Parallel Data Warehouse

≪ Previous: Discover missing rows of data

In this article I want to show some features about the Group By clause and the Row Number window function that you can use in SQL statements.

There are many situations where you want a unique list of items. But in the data source the items are not unique.

Let’s take an example of the AdventureWorks2012. If you want a list of Job titles from the Employee table, then the Job titles are not unique:

SELECT [JobTitle] FROM [AdventureWorks2012].[HumanResources].[Employee] WHERE [JobTitle] IN (N'Design Engineer', N'Research and Development Manager')

You can see that Design Engineer appears three times and Research and Development Manager appears twice in the data set.

In order to get a unique list of job titles you can use the group by clause:

SELECT [JobTitle] FROM [AdventureWorks2012].[HumanResources].[Employee] WHERE [JobTitle] IN (N'Design Engineer', N'Research and Development Manager') GROUP BY [JobTitle]

Now the list of job titles is unique. You can enhance the SQL statement for the following questions:

1) How many Design Engineers and Research and Development Manager do we have?

2) What is the smallest birth and hire date per job title?

For question 1 we use the count function and for question 2 the minimum function.

SELECT [JobTitle] , COUNT(*) AS count , MIN([BirthDate]) AS MinBirthDate , MIN([HireDate]) AS MinHireDate FROM [AdventureWorks2012].[HumanResources].[Employee] WHERE [JobTitle] IN (N'Design Engineer', N'Research and Development Manager') GROUP BY [JobTitle]

It is not guaranteed that the smallest birth and hire date are from the same employee. The following statement shows the data from all Research and Development Managers:

SELECT [NationalIDNumber], [JobTitle], [BirthDate], [HireDate] FROM [AdventureWorks2012].[HumanResources].[Employee] WHERE [JobTitle] = N'Research and Development Manager'

In this case line 1 gives you the smallest HireDate and line 2 the smallest BirthDate. Thus smallest BirthDate and HireDate belongs to different employees.

You can use the row number window function in order to get the HireDate from the employee with the smallest BirthDate per JobTitle.

WITH BASIS AS ( SELECT [NationalIDNumber] , [JobTitle] , [BirthDate] , [HireDate] , ROW_NUMBER() OVER (PARTITION BY [JobTitle] ORDER BY [BirthDate] ASC) AS rn FROM [AdventureWorks2012].[HumanResources].[Employee] WHERE [JobTitle] IN (N'Design Engineer', N'Research and Development Manager') ) SELECT [NationalIDNumber] , [JobTitle] , [BirthDate] , [HireDate] FROM BASIS WHERE rn = 1

The row number function numbers each row starting at one for each JobTitle which is included in the partition by section of the row number function. The numbering is ordered by BirthDate which is inlcuded in the order by section of the row number function. Row number 1 contains all data with the smallest BirthDate per JobTitle. You use an common table expression, because you cannot filter directly the result of the row number function. Below you find the result of the row number function:

SELECT [NationalIDNumber] , [JobTitle] , [BirthDate] , [HireDate] , ROW_NUMBER() OVER (PARTITION BY [JobTitle] ORDER BY [BirthDate] ASC) AS rn FROM [AdventureWorks2012].[HumanResources].[Employee] WHERE [JobTitle] IN (N'Design Engineer', N'Research and Development Manager')

by André Kienitz

↧

Practical table partitioning on the Parallel Data Warehouse

June 8, 2014, 8:54 am

≫ Next: Agile BI Tools – Database versioning with SSDT #1

≪ Previous: Using Group By and Row Number in SQL statements

APS/PDW 2012

This post is about table partitioning on the Parallel Data Warehouse (PDW). The topic itself is actually quite simple but there are some differences between the SMP SQL Server compared to the Parallel Data Warehouse.

On the SMP SQL Server table partitioning was important with large tables for two reasons:

Query performance
Workload management

For the SMP SQL Server, table partitioning allows queries and other table operations (like index rebuild) to be performed on multiple cores. Therefore table partitioning was done to improve query performance. However, the PDW architecture already stores larger tables (so called distributed tables) on each compute node by distributing it to multiple tables (so called distributions) on separate files in an optimal way for the available cores (currently 8 distributions per compute node). Therefore, when working on a PDW query performance usually isn’t the main reason for us to use table partitioning. But the second reason, workload management, still applies on the PDW. For example, when loading data it’s often useful to first load into a stage table, merge new and old data into a new partition and then switch that partition to the final table. So partitioning is still important on the PDW.

Creating a partitioned table on the PDW is a little bit easier compared to the SMP SQL Server as you don’t need (and don’t see) the partition schema or partition function. The following statement is an example of creating a partitioned table:

CREATE TABLE [dbo].[MyTable1] (
id int
)
WITH (DISTRIBUTION = replicate, PARTITION (id range right FOR VALUES (10,20,30,40)));

In this query and throughout the remaining blog post, I’m only referring to a range right partition function. This is my preferred option as I think it’s more intuitive although both ways are almost identical and all partitioning is usually handled by automated tasks. So it isn’t really important. However, range right means that the partition boundary is in the same partition as the data to the right of the boundary (excluding the next boundary). So for a range right partition function, the left boundary is included while the right boundary is not, for example a partition with boundaries 10 and 20 contains data with values greater or equal to 10 and less than 20 (for integer values: 10, 11, 12, … 18, 19).

By specifying four boundaries in the create table statement from above, we have actually created five partitions as shown in the following table:

Partition Number	Range From	Range To	Formula for id
1		10	id < 10
2	10	20	10 ≤ id < 20
3	20	30	20 ≤ id < 30
4	30	40	30 ≤ id < 40
5	40		40 ≤ id

On the PDW, the partition number is important for switch operations as well as for index rebuild operations. For example, in order to perform an index rebuild on partition 3 you write run this code:

ALTER INDEX «index name | ALL» ON MyTable1 REBUILD PARTITION = 3

The product documentation (aps.chm) contains a query to return “…the partition numbers, boundary values, boundary value types, and rows per boundary for a partitioned table”:

SELECT sp.partition_number, prv.value AS boundary_value,
lower(sty.name) AS boundary_value_type, sp.rows
FROM sys.tables AS st
JOIN sys.indexes AS si
    ON st.object_id = si.object_id AND si.index_id <2
JOIN sys.partitions AS sp
    ON sp.object_id = st.object_id AND sp.index_id = si.index_id
JOIN sys.partition_schemes AS ps
    ON ps.data_space_id = si.data_space_id
JOIN sys.partition_range_values AS prv
    ON prv.function_id = ps.function_id
JOIN sys.partition_parameters AS pp
    ON pp.function_id = ps.function_id
JOIN sys.types AS sty
    ON sty.user_type_id = pp.user_type_id
        AND prv.boundary_id = sp.partition_number
WHERE st.object_id =
    (SELECT object_id
     FROM sys.objects
     WHERE name = ‘MyTable1′)
ORDER BY sp.partition_number

Let’s try the query with our table from above. Here is the output:

Some observations may be confusing here. The first thing is that each partition is reported to contain 200 rows although we have just created the table and therefore expect the table to be empty. However, the reported rows are taken from the sys.partitions system view. In the documentation for the sys.partitions view you find the following remark about the number of rows:

Approximate average number of rows in each table partition. To calculate this value, SQL Server PDW divides the number of rows in the table by the number of partitions in the table.

SQL Server PDW uses statistics, which might be out-of-date, to determine the total number of rows. The statistics are from the most recent run of UPDATE STATISTICS on the table. If UPDATE STATISTICS has not been run on the table, the statistics won’t exist, and SQL Server PDW will use 1000 as the default total number of rows. To display the number of rows in each partition within each distribution, use DBCC PDW_SHOWPARTITIONSTATS (SQL Server PDW).

So, the number of rows is just estimated here, and since we don’t have created statistics for the table, PDW assumes the table to contain 1000 rows. But wait, 1000 rows divided by 4 partitions gives 250, not 200, right? Well, remember that we actually have 5 partitions although the meta data query from above only lists 4. I’m getting back on this soon.

Statistics are easy to create, so let’s do this first:

create statistics stat_id on MyTable1(id)

Here is the result when running the meta data query again:

So, now the number of rows seems to be correct. But be carefull: This still is only an approximation and you cannot expect the approximation to be accurate.

The other thing to be puzzled about with the output of the meta data query may be, that it only reports 4 partitions although we first figured out, that there should be 5 partitions. And also the boundary value may be confusing. For partition number one, we found out that it contains all data rows with id less than 10 (not equal). So the boundary value from the output is the excluded right boundary of our range right partitioning – confusing.

Stephan Köppen already posted some useful queries for the PDW (see his post here). Using his partition query gives a much better result. I made some minor changes to the query and for this blog post, here’s the version I’m using here:

create table #Partitions
with (LOCATION = USER_DB, distribution=replicate)
as
SELECT
p.partition_number PartitionNr
, cast(coalesce(lag(r.value,1) over (order by p.partition_number),-2147483648) as int) RangeFromIncluding
, cast(coalesce(r.value,2147483647) as int) AS [RangeToExcluding]
FROM sys.tables AS t
JOIN sys.indexes AS i ON t.object_id = i.object_id
JOIN sys.partitions AS p ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.partition_schemes AS s ON i.data_space_id = s.data_space_id
JOIN sys.partition_functions AS f ON s.function_id = f.function_id
LEFT JOIN sys.partition_range_values AS r ON f.function_id = r.function_id and r.boundary_id = p.partition_number
WHERE i.type <= 1
and t.name=’MyTable1′

create table #PartitionData
with (LOCATION = USER_DB, distribution=replicate)
as
select id, count(*) [rows] from MyTable1 group by id

– show partitions and number of rows
select PS.PartitionNr, PS.RangeFromIncluding, PS.RangeToExcluding, coalesce(Sum([rows]),0) [rows]
from #Partitions PS left join #PartitionData GT on PS.RangeFromIncluding<= GT.id and PS.RangeToExcluding>GT.id
group by PS.PartitionNr, PS.RangeFromIncluding, PS.RangeToExcluding

drop table #Partitions
drop table #PartitionData

If you’re only interested in the partitions, the blue part of the query is enough. The query uses the lag windows function to retrieve the lower boundary. The remaining query is used to obtain the exact number of rows for each partition. Please note, that the boundary information resulting from my modifications is only valid for a range right partition function. Here is the result:

As you see, this corresponds exactly to the five partitions from the first table above. The upper boundary of partition 5 should be increased by one to be 100% correct but this would conflict with the maximum integer value. If you like, just return null for the lower boundary of partition 1 and the upper boundary of partition 5 and observe this in the comparison with the existing data.

Also when reading the above query part that is printed in black you should adopt the method for counting the rows per partition to your needs. The method I’m using here proved to work fine for discrete values (integer ids). Since we usually partition by an integer column (for example a data written as yyyymmdd, 20140501 for May 1, 2014) this approach works fine for most of our workloads.

Next thing of interest is the partition number. As I wrote earlier in this post, the partition number is used for partition switch operations or for example for index rebuilds. It’s important to understand that the partition number is always a consecutive range of numbers starting with the number one. Even if you merge two partitions into one, the number is still consecutive.

For example, let’s merge partitions 3 and 4. In the merge partition statement we only need to specify the boundary. In a certain sense, this boundary is removed to form the new partition. In our case, partitions 3 and 4 share the boundary value 30, so the merge statement looks like this:

ALTER TABLE MyTable1 MERGE RANGE (30);

Here is the result using the modified meta data table from above:

As you can see, the partition number is still consecutive and the partition ranging from 40 to infinity now has the number 4 instead of 5.

If you specify a boundary, that doesn’t exist you’ll get an error message:

ALTER TABLE MyTable1 MERGE RANGE (35);

A distributed query failed: Node(s):[201001-201002]The specified partition range value could not be found.

Splitting a partition works very similar to a merge. Again, you can think of a split as of inserting a new boundary. For example, let’s split at the value 35 (which is in partition 3):

ALTER TABLE MyTable1 SPLIT RANGE (35);

Here’s the result:

Again, the partition numbering is still consecutive and the former partition 4 now becomes partition 5 because we split partition 3.

Now let’s validate our boundaries by inserting some lines of data:

insert into MyTable1 values(0)

As expected, the value 0 is written to partition 1 as –infinity ≤ 0 < 10.

truncate table MyTable1
insert into MyTable1 values(20)

The value 20 goes to partition 3 as 20 ≤ 20 < 35.

Now we’re going to insert 5 values which should fit the constraints for partition 4:

truncate table MyTable1
insert into MyTable1 values(35)
insert into MyTable1 values(36)
insert into MyTable1 values(37)
insert into MyTable1 values(38)
insert into MyTable1 values(39)

All of these values satisfy the constraint 35 ≤ x < 40 and therefore all the values are written to partition 4.

Ok, these were just some examples to see how data is written to the different partitions of our table.

To complete this post, I finally like to show a partition switch. Therefore we need to create a table of the same structure:

same columns, same data types, same nullable settings (take care when creating computed columns in a CTAS statement)
same table geometry (heap, clustered index, clustered columnstore index)
same distribution method (both tables replicated or distributed by the same key)
same indexes and constraints
partitioned by the same column (but the partitions itself may differ)

Generating the script for our table after the merge/split operation gives this result:

CREATE TABLE [dbo].[MyTable1] (
[id] int NULL
)
WITH (DISTRIBUTION = REPLICATE, PARTITION ([id] RANGE RIGHT FOR VALUES (10, 20, 35, 40)));

Now, replacing MyTable1 with myTable2 we can create a table of exactly the same structure:

CREATE TABLE [dbo].[MyTable2] (
id int
)
WITH (DISTRIBUTION = replicate, PARTITION (id range right FOR VALUES (10, 20, 35, 40)));

We can now switch the 5 rows of data from above. Since they are all stored in partition 4 we can switch them using this statement:

alter table MyTable1 switch partition 4 to MyTable2 partition 4

This is where we finally needed the partition number. We can now check the rows in table MyTable2:

select * from MyTable2

As you can see, all 5 rows are moved (switched) over to table MyTable2.

A common scenario for loading data into the appliance is to first load new data into a stage table of the same layout as the final fact table. Then our meta data query from above helps, by running it against both tables, using the boundaries as the join conditions. This results in the source partition and matching destination partition together with the number of rows in each of them. For example, if your workload contains only new or updated data you can now load the data as follows:

If the source partition contains no rows at all, quit
If the destination partition is empty switch the source partition directly into the destination partition and quit
Otherwise blend/merge the source and destination partition data into a new table with corresponding boundary values (this requires three partitions), then move the existing data out of the destination partition and finally move the merged data into the destination partition.

Summary: Partitioning on the PDW is still useful for workload management but usually not to increase query performance. With the query presented here, it’s quite easy to find the partitions together with their boundaries and number of contained rows. This information can be used to decide about a good partition switching strategy.

by Hilmar Buchta

↧

Agile BI Tools – Database versioning with SSDT #1

June 10, 2014, 11:00 am

≫ Next: Agile BI Tools – Database versioning with SSDT #2

≪ Previous: Practical table partitioning on the Parallel Data Warehouse

Microsofts SQL Server Data Tools (especially for databases) got stronger and stronger. It can compare whole database catalogs, generate deployment scripts, deploy initial table data any many more. In the end you can fully maintain (develop, build, deploy, document and reverse engineer) a database.

I’am using SSDT now for a while and sometimes I’am asking myself how to maintain the migration of a database developed in SSDT. Up to now we could avoid complex migrations. But it would be nice to have a process which handles such migrations more or less automatically. But before we dive into the different migrations scenarios I have to expose that a software version for a database is mandatory to maintain complex migration scripts. Typically you write migrations scripts to migrate a database from version x.y to x.y+1. This let you handle complex scenarios for example migrating a CDC enabled database or big fact tables.

The Software Version in SSDT projects

SSDT (to be precise DacFx; the backend of SSDT) has already a implemented a versioning mechanism. Lets take a look at the database properties:

As you can see the current database logical name and version is stored in the project file of the database. Because the project file is a simple MSBuild file we have many opportunities to extend the build behavior: for example auto incrementing a build number at the end of the version string or adding a debug or release suffix. One drawback is that this version number cannot be accessed by the script itself. It would be desirable to have a SQLCmd variable like $(DacVersion).

A marginal note – it seems to be that Microsofts preferred way maintaining migration scripts is to implement a build or deployment contributor (see here for reference http://msdn.microsoft.com/en-us/library/dn268597(v=vs.103).aspx). That implies you have to implement some C#-code which handles the migration for you. If you go this way you have to adress the normal bug fixing and deployment challenges to got it work in your development environment.

Lets get back to our “scripting” way to handle database migrations. We now examine a SSDT project file. As you can see the DacVersion property hold our version information. This property is a normal MSBuild property.

To access this information in our scripts we need to add a SQLCmd variable. Lets call it CurrentDacVersion.

In the next step we have to edit the project file. Search for the CurrentDacVersion SQLCmd variable. You should find something like this:

This is the definition of a SQLCMD variable. This variable is connected to the MSBuild variable $(SqlCmdVar__1). Changing this to $(DacVersion) solves the problem. Save the project file and reload is in Visual Studio. Examining the project properties SQLCMD tab will show now this:

Now you can reference your current database version with the $(CurrentDacVersion) SQLCMD variable in your post- or predeployment scripts with some limitations:

The local value overrides the default value. So the default value cannot be used anymore.
Changing the local value will not affect the version number in the project file. Changing the version has to be done in the properties window. So this SQLCMD variable in read only for the developers.
You need to synchronize your publishing profiles so that the variable got updated. However it is best practise to review the publishing profiles occasionally.

The next step is writing some SQL scripts which compares the target database version with the current database version and migrate some database objects (for example CDC enabled tables). I will cover this topic in my next post.

by Daniel Esser

↧

Agile BI Tools – Database versioning with SSDT #2

June 23, 2014, 6:28 am

≫ Next: Agile BI Tools – Implementing Feature Toggles in SSIS / SSDT

≪ Previous: Agile BI Tools – Database versioning with SSDT #1

In my previuous posting (Agile BI Tools – Database versioning with SSDT #1) I showed how we can reuse the already existing version number from the SSDT backend DAC. Today I will explain how to write a migration script based on database version numbering.

First of all I create two deployment script, one for post- and one for pre-deployment. As you can see the Build Action is set to PostDeploy repectivly PreDeploy. This means SSDT’s build mechanism will handle this files in a special way. In very simple terms the build prepends the Pre-Deploy-Script and appends the Post-Deploy-Script to the autogenerated deployment script at the end of the build.

What is Change Data Capture?

Referring to the Microsoft documentation Change Data Capture (CDC) is described as following:

Change data capture records insert, update, and delete activity that is applied to a SQL Server table. This makes the details of the changes available in an easily consumed relational format. Column information and the metadata that is required to apply the changes to a target environment is captured for the modified rows and stored in change tables that mirror the column structure of the tracked source tables. Table-valued functions are provided to allow systematic access to the change data by consumers.

A good example of a data consumer that is targeted by this technology is an extraction, transformation, and loading (ETL) application. An ETL application incrementally loads change data from SQL Server source tables to a data warehouse or data mart. Although the representation of the source tables within the data warehouse must reflect changes in the source tables, an end-to-end technology that refreshes a replica of the source is not appropriate. Instead, you need a reliable stream of change data that is structured so that consumers can apply it to dissimilar target representations of the data. SQL Server change data capture provides this technology.

To know how to deal with CDC we need to know a little bit more about how CDC works: The picture bellow shows the CDC data flow. As you can see there is a Capture Process. This process captures any changes made on the underlying tables. These changes will be written to so called change tables. If a process incrementally loads change data from SQL Server source tables to a data warehouse or data mart it can use the CDC query functions to get only the new data since last run.

But Change Data Capture comes with some limitations. Lets focus on the limitations which prevents a smooth database migrations with SSDTs built-in functionality. Change the table layout of an CDC enabled table does not work out of the box. If you try you will end up with a message like this:

Update cannot proceed due to validation errors. 
Please correct the following errors and try again.

SQL72035 :: [dbo].[MyCDCEnabledTable] is under change
data capture control and cannot be modified

The reason for this behavior is that the corresponding change table layout is based on the layout of the CDC enabled table. The capture process therefore needs a fixed table layout. To come around this limitation you have to disable CDC for this table. Doing so will delete the corresponding change table with its content which means you will loose the current change state of the table.

How to deal with Change Data Capture during database migration?

At general to deal with CDC you will need to follow this recipe:

Pre-Deployment
1. Copy the change table with a SELECT INTO statement to save the current CDC state.
2. Disable CDC on the CDC enabled table with the sys.sp_cdc_disable_table stored procedure.
Deploy the Database with SSDT – This step for example will add, delete columns or change data types. Data must not be deleted in this step!
Post-Deployment
1. Enable CDC with the sys.sp_cdc_enable_table stored procedure. An empty change table will be generated in this step.
2. Copy the table backup to the empty change table with a INSERT INTO statement.
3. Update the the column __$start_lsn in the cdc.change_tables table to the original value.
4. Drop the backup table.

OK lets get a little more into detail: Step 1.1: Backup Change Table

Step 1.2: Disable CDC

Step 2.0: The original SSDT generated alter scripts.

Step 3.1: Enable CDC

Step 3.2: Restore Change Table

Step 3.3: Restore the start lsn

Step 3.4: Drop backup table

Further Opportunities

As you will certainly have noticed there is still a limitation in the above recipe. In step 3.2 we restore the data with a INSERT INTO … SELECT * statement back to the change table.

If you remove columns you will get an error because a column is missing which exist in the source table. Changing data types will lead you to explicitly define type casts. If you add columns with NOT NULL constrains you will end up with the problem to define default values. For the moment you have to modify step 3.2 for your needs.

Wrap Up Database Versioning

In the pre-deployment script we add some code to handle the target and the current database version:

We now can reference the target database version in the pre- and post-deployment scripts with the TSQL variable @TargetDacVersion and the current database version with the SQLCMD variabele $(DacVersion).

In the BEGIN … END section we now can reuse the above CDC recipe.

Conclusion

I showed that you can use SSDT to handle some kind of special database migration, for example CDC. It would be nice to have a more “intelligent” mechanism to compare the database versions but then you would have to deal with parsing the version number or put the major, minor and hot fix portion in to separate variables. Some of the functionality could be hidden in stored procedures to leaf the deployment script in a clean state.

As I mentiond in my previuos post it seems to be that Microsofts preferred way maintaining migration scripts is to implement a build or deployment contributor (see here for reference http://msdn.microsoft.com/en-us/library/dn268597(v=vs.103).aspx). I will try cover this in my next post.

by Daniel Esser

↧

Agile BI Tools – Implementing Feature Toggles in SSIS / SSDT

June 27, 2014, 9:00 am

≫ Next: SSAS optimization: The Order of Aggregations

≪ Previous: Agile BI Tools – Database versioning with SSDT #2

Reminding the values of the Manifesto for Agile Software Development: Individuals and interactions over processes and tools, working software over comprehensive documentation, customer collaboration over contract negotiation and responding to change over following a plan.

Focussing on working software has lead us to a process commonly known as continuous integration. A working CI process will give us a continuous feedback about the code quality in our repository. This requires you to integrate your changes as early as possible. Feature branching, for example, is an anti-pattern for the whole CI process because it leads to isolated code areas (isolated from other features or software components) and therefore introduces a by-pass to this process. By the way, feature branching is not evil by nature, but it will slow down the whole CI process so you have to weigh all the pros and cons.

A Feature Toggle is a technique that attempts to provide an alternative to maintaining multiple source code branches. This posting addresses the question how to implement a feature toggle in SQL Server Integration Services (SSIS).

Scope

In SSIS we can have a feature on different scopes, so we have to define the scope for a feature toggle. The term “feature” is a bit squishy. My research discovers different scopes for an SSIS feature.

Project – In SQL Server 2012 SSIS deployment uses a project centric approach. Projects will have one or more entry packages. These entry packages can be compared to the main() function of an executable. How these entry packages are get called is up to a scheduler or a worflow management tool and therefore out of scope of this article.
Packages – Non-Entry-Packages commonly dealing with transporting data from A to B, data conversion or aggregating tasks.
Control Flow – Every package has a control flow. This control flow consist of actions. These actions can be executed in sequential or parallel. Conditional construct like if-then-else (in a visual fashion) are possible.
Data Flow – Every packages consist of zero to n data flows. A data flow is a specific action within a control flow and can transport data from multiple source to multiple destinations. A key task of an data flow is to extract, transform and load data. So a data flow can consists of multiple streams which get wider, narrower, split or brought together. For now I could not imagine a situation where a feature toggle would be useful in this context. If the new feature consists of a modified data flow then we have to copy the whole data flow on the control flow which leads us to a toggle on the control flow.

Feature Toggle on Packages and Control Flow Scope

Implementing feature toggles in a control flow is some kind of obvious because its similarity to other programming languages. The analogy of an if-then-else construct in SSIS are called Precedence Constraints (the links between the boxes).

Precedence Constraints

Every precedence constraint can have an expression. This expression get evaluated during the execution of the package. You can use an project or package parameter to implement a feature toggle. Lets imagine we have created a project parameter called EnableBusinessLogic of type boolean.

Toggle by constraints

We now can refer to this parameter in the expression on the precedence constraint for the new feature:

Evaluation operation – Expression and constraint
Value – Success
Expression – @[FeatureToggle::NewBusinessLogicFeature] = true

FT_Example 02

Also we can refer to this parameter in the expression on the precedence constraint for the outdated feature:

Evaluation operation – Expression and constraint
Value – Success
Expression – @[$Package::EnableNewBusinessLogic] == false

Toggle by “Disable Execution” property

You can achive a simalar behaviour by using the disable execution property. Set an expression on the component like this:

Disable - !@[$Package::EnableNewBusinessLogic]

How to implement two (or more) features in the same Data Flow

Imagine there is a old implementaton for loading customer data into a database. There is now a request to load some new fields from the CRM. Additionally at the end of the year the whole company will switch the CRM to a new provider. At the same time a bug in the old code has to be fixed. In the next release only the bug fix should be deployed because the other features needs more testing.

You now have two basic approaches to implement this features. The first one uses three branches, one for the extension of the old code, one for the migration to a completly different CRM solution, and one for the bug fix. The problem here is that the bug fix version and the version with the new fields are separated from each other So there is no integration testing for the two features. The second problem is that when for some reason the bug fix and the version with the new fields got be released together a merge of two branches is needed, which is not practicable for SSIS packages.

The second approach uses feature toggles without branching. Implementing toggles by using the disable execution property let us control which code got executed at runtime:

When this features needs to be implemented in parallel by different developers separate the data flows to different packages by using the execute package task. All developers have to ensure that any changes they made are distributed to the other features as well (if needed). This task is the counterpart of merging branches, but …

You don’t have to resolve conflicts because the different versions lay in the same solution file and branch
You can easily switch which feature version is executed at runtime (by configuration)

How to control the different feature versions for a release?

Visual Studio can handle this by different solution and project configurations. Like a C# project, a SSIS project can have a Debug and a Release configuration with a slightly different behaviour. First of all create two solution configurations. You can achive this with the configuration manager in the toolbar:

Development should hold the configuration for the current developer (you can also create differen development configurations). Release should reflect the configuration as it is needed for a procutive environment. Now we create three different project release configurations reflecting the different feature toggles:

In the configuration manager we now can map the solution configuration to the project configuration. In this example we mark the “CustomerNewDataSource” feature as the current productive one:

In the last step we need to add the default configuration values for the specific project configuration. You can do this in the package / project parameter view:

Building the current SSIS project with the Release solution configuration (and therefore the Release_CustomerNewDataSource project configuration) activates this specific feature and disables the other ones.

Metadata on Data Flows Components

During development data flow components gain informations about tables, columns and data types of the used sources and destinations. This informations, called metadata, will be saved for runtime of the SSIS package. Before a package gets started this metadata will be checked. If for example a data source reads from a table named CustomerOrders and this table does not exist or has a different signature (different columns or data types) the validation of the package will fail. This will lead us to a problem.

Imagine you have to implement a new feature and there is at least on data flow affected by this new feature; the recommended way to implement this feature together with a feature toggle is to copy the old data flow and set an appropriate expression on the precedence constraint of the old and the new code. But what to do if the table signature is different? The package validation fails because the metadata will be checked for both the old and the new data flow.

A simple way to come around with this limitation is to set the DelayValidation property of the data flow to true. Now validation of the data flow metadata will be delayed to the time the data flow gets executed. Because the feature toggle prevents the execution of the inactive feature only the data flow of the active feature got validated.

How to manage which feature will be active?

First you need a release plan. This plan should respect budget, capacity, technical dependencies and dead lines. Now you can group functionalities to features with a fix delivery date. Your build should reflect this release plan. For example if the release with version 1.5.0 should contain the features A, B and C you need a post build test mechanism to validate that the corresponding feature toggles are set (or not set if the features are not part of this release). You can accomplish this with a simple PowerShell script or an XSL transformation on the SSIS packages. You could also push this test forward to the step of integration and directly test (a simple database test) the package or project parameters in your integration environment.

Conclusion

Implementing a feature toggle in SSIS is possible. We delay (not lose) the validation of the metadata on concurrent features (if a data flow is part of this feature). By doing so we can simply work in one single branch, and can avoid merges between branches, we can also reduce the problem we had with hot fixes in other branches that wasn’t merged. We can also continuous deliver completed feature into production that is not yet enabled, but can be enabled by the product owner when he feels it’s time to enable it. If a bug of a new deployed feature is found and a critical fix need to done, it can be turned off in production.

by Daniel Esser

↧

SSAS optimization: The Order of Aggregations

July 5, 2014, 9:37 am

≫ Next: SSAS Writeback: Performanceoptimierung

≪ Previous: Agile BI Tools – Implementing Feature Toggles in SSIS / SSDT

Aggregations play the central role for the user experience with large SSAS cubes. There are many important aspects for optimal aggregation design. The order in which the aggregation are defined is actually the one of them.

If a query can be answered using more than one aggregation – the first one from the aggregation definition will be taken – not the smallest one!

It is not a problem if a query produces just a few reads, but in case of complex calculations and large cell spaces with hundreds and even thousands reads the overall query performance can be dramatically different. Don’t forget that the fact data not only should be read from disk, but also aggregated in memory to the requested grain (if it does not match the request).

One more consideration: certain amount of CPU time is consumed when SSAS searches for the matching aggregation. The percentage of this time in the whole “give me data” operation (search-read-aggregate) is noticeably bigger for the small aggregations. Thus it makes sense to hold small aggregations in the head of the search list.

So optimally the cube aggregations should be sorted by size (ascending).

The info about aggregation size can be viewed using BIDS Helper (Physical Aggregation Sizes) or with the following DMV query:

SELECT * 
FROM SystemRestrictSchema($system.discover_partition_stat
        ,DATABASE_NAME = 'Adventure Works DW 2008'
        ,CUBE_NAME = 'Adventure Works'
        ,MEASURE_GROUP_NAME = 'Internet Sales'
        ,PARTITION_NAME = 'Internet_Sales_2003')

You can also use the DISCOVER XMLA command with RequestType=DISCOVER_PARTITION_STAT.

The aggregations to be reshuffled can be found in the solution in the .partitions file. Alternatively you could do it directly on the deployed cube using XMLA script with ALTER AGGREGATION DESIGN.

Technically our aim is to sort “Aggregation” XML nodes in the .partitions file (or in the ALTER-Script) using the size values. There are several ways to automate this optimization: using PowerShell, C# Script, XSLT (optionally in SSIS package).

Since you occasionally want to make it manually, here is the straightforward example.

For the sake of simplicity let’s place the size info into “Descr.” elements of Aggregation nodes. Thus we can sort XML nodes using the info that the nodes are featuring.

Let’s say we have following aggregation sizes in BIDS Helper.

You can directly copy it from this dialog to excel for the later usage.

Now we need an XML editor which is able to sort nodes and save them with the new order. We will use Stylus Studio (Professional) in this example.

1. Let’s open .partitions in Stylus Studio and navigate in Tree View to the Aggregations node in the Aggregation Design of interest.

Now we can edit Aggregations in “Table view”.

2. Attention! In order to have the Description visible, we need to have at least one non empty description for our aggregations. You can add it in Stylus Studio (“New element”) or do it directly in BIDS:

Now let’s copy-paste the aggregations from Stylus Studio (Copy As Tab-Delimited in Table View) to Excel and merge then with the size info (from BIDS Helper or DMV).

Note that we use “numeric” form for percentages since they will be sorted as text!

Now copy-paste it back to Table view in Stylus Studio.

Now let’s sort them by Description and press Save.

Note that Stylus Studio puts <!xml> header in the file. Just delete it manually before using the file.

Now after updating your cube and Process-Index for the measure group, here is how the agg sizes should look like:

Now the small aggregations have always preference!

The sorting of aggregations with size not only contributes to performance but also makes an aggregation usage review easier when you consider to delete unused large aggregations, which take a lot of processing and disk space. We recommend to have the sorting of aggregation as a part of your aggregation maintenance strategy.

by Michael Mukovskiy

↧

SSAS Writeback: Performanceoptimierung

August 1, 2014, 8:00 am

≫ Next: Migration of SSIS 2008 Package Connections to SSIS 2012 Project Connections

≪ Previous: SSAS optimization: The Order of Aggregations

Eine verbreitete Meinung lautet: Writeback ist lediglich performant bei einer kleinen Anzahl von Usern.

Unsere Erfahrung hat uns jedoch etwas anderes gelehrt: Die Performance des Writebacks ist vor allem davon abhängig, wie die Architektur und allgemeine Performance des Cubes aussieht, und wie genau die User mit dem Writeback arbeiten (wie oft die COMMITs erteilt werden). Eigentlich sind diese drei Faktoren in der Regel beeinflussbar. Auch bezüglich der Arbeitsweise kann man den Usern empfehlen wo möglich mehr UPDATEs unter einem COMMIT sammeln.

Wir lassen jetzt aus dem Scope die Writeback-Szenarien auf den höheren Cube-Ebenen, die unvermeidbar zur großen granularen Verteilungen führen. In solchen Fällen ist es logisch, dass die großen Ressourcen für die Verteilung der Werte erst bei UPDATE im Session-Cache, dann bei COMMIT für Bulk Insert in die Tabelle der Writeback-Partition und anschließend für die Aktualisierungen in anderen Sessions benötig werden.

Bei Writebacks nah zur Faktengranularität (Leaves) sind diese Aufwände an sich eher bescheiden.

Das Hauptproblem des Writebacks ist, dass der COMMIT technisch als Cube-Prozessierung angesehen werden kann und zwar mit üblichen Problemen bei Prozessierungen: Sperrungen der AS-Datenbank, Cache-Verluste und Abbrüche der langlaufenden Abfragen. Alle diese Downsides kommen nicht in etwas vom “suboptimalen Design” des Analysis Services, sondern sind als normales “Tradeoff” innerhalb eines Systemes anzusehen, in welchem Lese- sowie Schreiboperationen zeitgleich vorkommen.

Da in der Regel bei mehreren parallel arbeitenden Writeback-Usern die Anzahl der COMMITs steigt, sinkt im Umkehrschluss die allgemeine Cube-Performance spürbar. Dies spiegelt sich zudem in der User Akzeptanz wieder.

Anbei betrachten wir eine Standardsituation, bei welcher während einer langen MDX-Abfrage ein COMMIT in die gleiche AS-Datenbank kommt:

Der COMMIT wartet bis 30 Sekunden (Defaulteinstellung) auf langlaufende Abfragen. Innerhalb dieses Zeitfensters ist die ganze AS-Datenbank für alle weiteren Aktivitäten gesperrt.

Dies betrifft nicht nur die Abfragen, sondern auch die UPDATEs, die ansonsten schnell sind, die aber hier als Teil eines „langsamen Writebacks“ wahrgenommen werden. Dass die Langläufer nach 30 Sekunden abgebrochen werden – bringt eine zusätzliche negative Erfahrung bei Usern.

Und das ist genau unsere Beobachtung: die langlaufenden Abfragen auf die gleiche Datenbank verursacht meistens die Probleme mit Writebacks.

Eine weitere Downside: Die Abfragen bei Writeback-Usern zwischen UPDATE und COMMIT laufen genauso langsam wie mit einem “kalten” Cache, da die “normalen” Cube-Daten jetzt mit den Writeback-Deltas aus dem Session-Cache zusammengeführt werden müssen!

Die Hauptmaßnahmen lauten also: den Cube und Abfragen für die Performance so optimieren, dass die Abfragen auch mit “kaltem” Cache maximal nur wenige Sekunden laufen, und möglichst Cube(s) in separate AS-Datenbanken splitten um die breiten Locks zu vermeiden.

Für die Problemdiagnostik reicht normalerweise im Profiler die AS- und SQL-Instanz zu beobachten.

Noch ein Hinweis: man muss sicherstellen, dass die relationalen Writeback-Tabellen von INSERTs und DELETEs nicht zu fragmentiert sind, was zur unnötigen Verschwendung der Ressourcen bei einfachen Operationen führen kann. Hier schadet ein regelmäßiges TRUNCATE TABLE nicht.

by Michael Mukovskiy

↧

Migration of SSIS 2008 Package Connections to SSIS 2012 Project Connections

August 3, 2014, 12:23 pm

≫ Next: The ‘KeyColumns’ #0 has NullProcessing set to ‘UnknownMember’, but the dimension doesn’t have UnknownMember set to ‘Visible’ or ‘Hidden’

≪ Previous: SSAS Writeback: Performanceoptimierung

When migrating an existing Data Warehouse from SSIS 2008 to SSIS 2012 you might want to use the Project Deployment Model instead of the Package Deployment Model, because it provides the possibility to use Project Connection Managers which can be used by all SSIS packages in your project. This is very useful when your packages all need to connect to the same databases, because then you only have to configure every Connection Manager once and this configuration is valid for all packages in your project.

The easiest way to do this seems to open a SSIS 2008 package in SSDT 2010 (or newer versions), right click on the Package Connection Manager, choose „Convert to Project Connection“ and do this for every Connection Manager.

This works fine, but only for the first package in the project. As soon as you want to migrate another SSIS 2008 package (I guess that this is usually the case) and reuse the Project Connection Managers that you created during migration of the first package you cannot use the „Convert to Project Connection“ functionality any more. Because when there already exists a Project Connection Manager with the same name then SSDT creates a new Project Connection Manager that you would have to configure separately.

The next idea could be to delete the Package Connection Managers, because then the package can access the Project Connection Managers. But this invalidates all your Control Flow Tasks and Data Flow Components, because they all lose their connection. So you have to fix every Task and every Component manually. For a small package with simple functionality and only some Tasks this might be acceptable. But packages that use many Connection Managers and dozens of Tasks or Components would generate high effort for changing every item. And especially when you are not the original developer of the package this could become very tricky. Just imagine that you have to choose the proper connection for every Data Flow Component in a package that looks like this (or is even more complex):

To solve this problem you have to consider that SSIS packages are XML-files, so you can open and edit them with a text editor like e.g. Notepad++ or UltraEdit.

For every Source and Destination component you find a XML tag like this:

Here you need to replace the connectionManagerID with the appropriate ID of the Project Connection Manager and the extra text “:external”. For the connectionManagerRefId you need to write „Project“ instead of „Package“.

Repeat this for every Source and Destination in your package.

For every Execute SQL Task in your Control Flow you can do something similar. Here you have to change the SQLTask:Connection Property.

Replace it with the ID of the Project Connection Manager:

Next save the package and then delete the Package Connections Managers that you replaced with Project Connection Managers. You can either use SSDT (in this case ignore the error message that states that the Project Connection Manager cannot be accessed because there also exists a Package Connection Manager of the same name) or edit directly in XML and save the package. SSDT should now be able to load the project without any errors. Now try to execute the package:

Et voilà, the package is runnable and uses the Project Connection Managers.

There is still one open question: How do I get the ID of the new Project Connection Managers? Again consider that SSIS packages are XML-files, so simply open the Connection Manager files (*.connmgr) in a text editor. Here you can find the ID that you have to use:

One last remark: This is of course only one way to migrate Package to Project Connection Managers, but it is quite simple and it does not cause extra costs like license fee for tools. So depending on the size of your project this could be a good alternative to purchasing external tools.

by Thomas Rahier

↧

The ‘KeyColumns’ #0 has NullProcessing set to ‘UnknownMember’, but the dimension doesn’t have UnknownMember set to ‘Visible’ or ‘Hidden’

August 10, 2014, 4:27 am

≫ Next: SSAS: Writeback Performance

≪ Previous: Migration of SSIS 2008 Package Connections to SSIS 2012 Project Connections

SQL Server 2005-2014

By default, SSAS provides an automatically created member with the name ‘unknown’ for each dimension. This member is intended to be the home for all facts that don’t fit to a real member (provided from the data source). In the example above, fact data that does not match any of the listed product categories could be mapped to the unknown-element.

I’m saying ‘could’ and not ‘is’ because the rules for mapping fact data to the unknown-element can be configured in the dimension properties.

But using this mechanism has certain drawbacks:

Processing time increases a lot if one row is encountered which has to be mapped to unknown
Only one text (for example ‘unknown’, can be configured) for all attributes in the dimension
Cases are hard to find since you don’t see this mapping in the underlying data warehouse tables

In a good data warehouse design, ETL takes care of the correct mapping of fact data to its dimensions by using surrogate keys. Each join is then an inner join. In order to do so, dimension tables usually contain a row for the unknown element. Frequently, the surrogate key –1 is used for this row.

But following this best practice results in the SSAS dimension showing two elements for ‘unknown’: The dimension entry and the automatically created entry.

So, why does SSAS dimension have this build-in unknown element by default? If we build almost all SSAS cubes based on a good data warehouse design where the dimensions maintain their own unknown element, there is no need for an automatically created unknown element anymore. But since SSAS cube’s wizard is intended to work with most types of data structures, the unknown element is there by default. Without having ETL enforced surrogate keys you just cannot be sure, that every fact row maps to its dimensions.

So, as explained above, we want to remove this default unknown element in almost all SSAS cube development projects. This can be easily done in the properties dialog of the dimension:

There are four available options for the unknown member:

visible	The unknown-member of the dimension exists and is visible
hidden	The unknown-member of the dimension exists and is hidden
none	The unknown-member of the dimension does not exist
automatic null	The unknown-member of the dimension exists und is visible, if there are violations of the referential integrity (fact keys not found in in dimension).

Again, if we take care of the surrogate keys in the ETL process, there is no need for a dimension unknown element at all. So, the best option is, to disable it (UnknownMember set to none).

However, because of other default settings, you’re getting the following error when trying to deploy your SSAS model afterwards:

The ‘KeyColumns’ #0 has NullProcessing set to ‘UnknownMember’, but the dimension doesn’t have UnknownMember set to ‘Visible’ or ‘Hidden’

If you’re getting this error for the first time, it might not be clear, where to fix it, especially since there are two changes that need to be made:

1. Adjusting the dimension key attribute

If you look at the dimension, you’ll notice the red hash-line below the key attribute of your dimension.

In order to fix this, you’ll need to open the properties of that attribute. Now navigate to the key columns setting and expand the view for each of the columns (since you may have more than one column bindings for the attribute’s key) as shown in the following screenshot:

Here you can set the NullProcessing option to “Error”. The default is “UnkownMember” but since we just disabled this, this causes the error.

Remember to do this for each key column of this attribute.

2. Adjust null processing in the dimension usage.

The second place to modify is the dimension mapping. Therefore open the cube and go to the dimension usage tab. You’ll notice the red hash-line at the attribute (in my example, the Product ID):

In order to fix this, click the button near to the attribute (the one labeled with “-“) to open this dialog:

Click advanced to edit the properties of the mapping:

In the lower part of the dialog, you can set the “Null Processing” from “UnknownMember” to “Error”.

After this change you should be able to deploy the cube again.

by Hilmar Buchta

↧

SSAS: Writeback Performance

August 12, 2014, 2:28 am

≫ Next: OLAP für Business: Ein wahres Märchen

≪ Previous: The ‘KeyColumns’ #0 has NullProcessing set to ‘UnknownMember’, but the dimension doesn’t have UnknownMember set to ‘Visible’ or ‘Hidden’

We have heard a lot: writeback can only be used with a very small number of users. Otherwise it is too slow.

We can rather tell from our experience, that the writeback performance depends first of all on query performance (SELECTs) and on writeback usage patterns (the frequency of COMMITs). Both can actually be manipulated. The query performance should always be a subject of monitoring and optimization and even for the usage patterns we can recommend users to group more UPDATEs under COMMIT.

Here we put out of scope the writebacks on the high levels which lead to large distribution on leaves. In this case Analysis Services needs substantial resources for changes in session cache (UPDATE), then for BULK INSERTs into the writeback table and for the recalculation of other session writeback caches (COMMIT).

For writeback on leaves (or not far from leaves) Analysis Services does not have really so much to complete.

From the performance point of view the writeback can be considered as a scenario with a frequent cube processing with its usual problems: database locks, empty cache and canceling of queries. It is not some kind of suboptimal design of Analysis Services, but just a normal trade-off for systems with SELECTs und UPDATEs at the same time.

More writeback users issue COMMITs more often having as a result the degradation of the overall cube performance.

Let us consider a situation where a COMMIT starts while a long SELECT is running at the same AS database:

The COMMIT waits max. 30 seconds (default) for the long running query holding on lock the whole AS database.

Not only new SELECTs are blocked but also other UPDATEs, giving the impression of “slow writeback” to other writeback users! Moreover, the canceled long running queries (default 30 seconds) add negatively to the user experience.

And this is what we usually see: the long running queries for the same AS database are the main cause for the “poor writeback performance”.

Another point: the queries of writeback users have a “cold cache” between first UPDATE and COMMIT because of the session deltas added to the results!

So the main two points for the optimization are: design your cube and queries to run only few seconds even with the “cold cache” and separate cube(s) into several databases to restrict the locks.

Normally it is sufficient to diagnose the problem using the SQL Server Profiler for AS and SQL instances.

And yet another hint: the frequent INSERTs and DELETEs in writeback tables can bring a substantial table fragmentation which slows down the relational part and as a result the whole writeback chain. It is not a bad idea to make TRUNCATE from time to time.

by Michael Mukovskiy

↧

OLAP für Business: Ein wahres Märchen

August 22, 2014, 5:04 am

≫ Next: Performance optimizations when loading many small files in SSIS

≪ Previous: SSAS: Writeback Performance

Wir haben gestern zu später Stunde einen Anruf bekommen. Der Kunde hat sich beschwert, dass der SSAS Cube, der vor zehn Jahren entwickelt wurde, in letzter Zeit immer langsamer werde.

Es hat sich schnell herausgestellt, dass das Problem bei der Partitionierung liegt. Um die Kosten minimal zu halten, wurde eine statische Partitionierung verwendet. Es wurden damals die einzelnen Partitionen nur für die ersten 5 Jahre angelegt. Damit die weiteren Jahre auch im Cube landen, hat man die letzte aber mit einem open-end bei WHERE erstellt. Jetzt hat die letzte Partition schon die Daten für 5 Jahre gesammelt und weit über 20 Mio Zeilen (maximale empfohlene Länge) erreicht.

Interessant ist, dass der Cube schon seit 10 Jahren unter SSAS 2000 auf Windows NT läuft. Der Cube wird täglich mit DTS-Paketen aktualisiert. Und seit 10 Jahren hat der Cube, obwohl intensiv benutzt, keinerlei Wartung erfahren. Das System, das damals für 25.000 EUR erstellt wurde, hat zwei Mitarbeiter, die davor ausschließlich die Reports manuell im Excel erstellt haben, für andere Tätigkeiten frei gemacht und dadurch schon über eine Million EUR (sic!) für ein kleines Unternehmen gespart!

In fünf Minuten haben wir die Partitionen für die weiteren 30 Jahre erstellt. Natürlich mit einem open-end.

by Michael Mukovskiy

↧

Performance optimizations when loading many small files in SSIS

September 14, 2014, 4:36 am

≫ Next: Combining multiple tables with valid from/to date ranges into a single dimension

≪ Previous: OLAP für Business: Ein wahres Märchen

SQL Server Integration Services (SSIS)

In general, reading text files from SQL Server Integration Services (SSIS) is not a complicated task. The flat file source offers a user friendly interface to deal with separators, header lines and code pages/unicode. It can even determine the best data type for each column by scanning sample rows from the text file. And if your flat file is in XML-format you can use the XML source component in SSIS to read the file’s contents.

In many cases however you will not just have a single file but a directory containing many files instead. SSIS offers the for-each loop container in this case, to create a loop over all those files. The for-each loop container also has a friendly user friendly interface, so you can also easily solve this task in SSIS.

In this post I’d like to discuss a scenario with many small files to import and I will compare the for-each loop approach with a single data flow approach. Therefore, the task was to read about 2,100 files in JSON format into a single SQL Server database table. Each file has up to 80 KB in size containing from 1 to about 350 rows of data with an average of about 310 rows per file. So, in total I had to import about 650,000 rows of data from about 2,100 files. Doesn’t sound like a big deal so far.

First I created a for-each container with a single data flow:

The for-each container’s type is set to a “Foreach File Enumerator” scanning all files from a given directory:

The data flow simply reads the current file (I’m not going into details about the JSON-file here but some library like JSON.Net will do), does some minor changes (derived column) and writes the results into a SQL Server database table:

Running the package imported all of the files into my data table BUT… it took much longer than expected. In fact, it took 8 minutes. So whatever my expectation was before running the package, this was way too slow. First I checked if I made some mistakes. The OLE DB destination was set to use a fast table load with a table lock not checking any constraints. So, this was ok. The destination table was a heap with no primary key, so there wasn’t a problem with index reorganization.

Checking the progress log revealed that validation, pre- and post execute events are executed for each file. And since each file contained only a few rows, very small batches were committed in each loop causing the bad performance. Also, sending small batches to a table may be a bad idea depending on your table geometry. For example, when using clustered columnstore index tables, sending small batches results in asynchronous compression cycles as explained here.

If the flat files are actually CSV files, the best approach is to use the MULTIFLATFILE Connection Manager. Actually I must admit that I wasn’t aware it existed, until a colleague showed it to me. So here are the instructions to find this connection manager: When you right-click in the connection section of your package, a dialog appears to choose the connection type. Click on “New Connection…” here.

In the following dialog you can choose the MULTIFLATFILE Connection Manager. It is configured in exactly the same way as the standard flat file connection manager, but now you can specify multiple files or directories to scan.

But since my source files we’ not in CSV file format, I had to go for a different approach here. I replaced the for-each loop container with C# code inside the script component from above. Here is the corresponding code I used:

public override void CreateNewOutputRows()
{
    String path = "c:\\temp\\JSON_Import";
    foreach (String filename in Directory.EnumerateFiles(path))
    {
        // process single file
    …
    }
}

Again, I’m not going into details about the actual code for importing the JSON file here, but the code above shows how simple a for each loop can be implemented within a script (of course you will want to add some error-handling and use a package variable for the import folder instead of the constant string here). The remaining parts of the data flow were left unchanged.

This time, importing all of the 2,100 files took 10 seconds, so this approach was about 48 times faster, than the for-each loop container.

Conclusion

In SSIS, writing data to a database table using a data flow gives the best performance if you have a large number of rows. However, importing many small files from a directory using the for-each loop container results in the opposite: many inserts with just a few rows each. If you encounter performance degradations in such a scenario, using the MULTIFLATFILE connection manager or, if not possible, converting the for-each loop container and the file read operation itself into a script task, may result in a much better performance. To improve performance even more, you could also try to parallelize the script tasks (for example the first script importing files 1, 3, 5 … and the second one importing files 2, 4, 6, …).

by Hilmar Buchta

↧