Concepts and Techniques on Business Analytics: 2011

Wednesday, December 14, 2011

SQL Saturday #119 – Chicago

sqlsat119_web SQL Saturday #119 is scheduled for May 19, 2012 and will be at DeVry University in Addison, IL. This is a free event but there is a $10 charge for lunch. Sign up soon as last year it was “sold out” weeks before the actual date.

This is a good way to learn new tricks with the Microsoft BI tools as well as SQL Server in general, but also to network with other folks (Microsoft folks, MVPs, PASS folks, authors of SQL books, and some really smart people who enjoy sharing what they know.)

I’m signed up and would encourage you to consider attending.

Thursday, November 10, 2011

Columnstore Index In the Wild - A First Look At a Columnstore Index In SQL Server 2012

I had a chance to do a proof of concept project with a client who was looking to migrate from an unsupported version of Sybase IQ to SQL Server 2012. The client was interested in keeping the database as similar as possible and then point the existing BI reporting applications to the new SQL Server. The client was also interested in proving that SQL Server can perform as well as Sybase IQ. Remembering that Sybase IQ is one of the leading high end column oriented databases that stores data in columns, rather than rows, this seems like a bit of a challenge. I thought this would be a nice chance to take a real world look at SQL Server 2012's new columnstore index.

There is a comprehensive article by Eric N. Hanson about the requirements and things to consider when implementing a columnstore index. I suggest reading this article before you get started so you can get an idea on the memory requirements and make an informed decision on the number and length of columns your server can support.

I wanted to give some information on the server and data characteristics so you can compare this to your environment. The server used for the proof of concept project was a virtualized server with Windows 2008 R2 (64bit) that had 2 cores, 8 GB memory, and SQL Server was setup with 7GB cap. The main fact table used for the proof of concept project was loaded with 25 million rows and contained almost 100 fields. The size of the table came out to about 20GB.

Creating the columnstore index could be done using SQL Server Management Studio by clicking on any table and then clicking on the indexes folder. You will now have the option to create a traditional binary tree index as well as the new columnstore index. This should be familiar as creating any other index. The index can be saved off as a script and executed later.

A few things I came across worth mentioning is that, first a table can only have one columnstore index. This is covered in all the documentation, but the implication is that you will need to put some thought into what columns will be included as putting all the columns in the columnstore index is not always possible since there is a memory requirement. This is also covered in Eric's article which gives the formula you can use to calculate the memory size required.

Second, INSERTS, UPDATES, and DELETES are prohibited on columns in a table that are included in a columnstore index. The columnstore index must first be DISABLED. After you are finished updating the table the index has to be REBUILT. Additionally any ALTER TABLE statements on the table are not allowed on the columns that are included in the columnstore index until you DISABLE the index. This seemed logical after thinking about it, but was surprised when I got the error message since this is a difference between a traditional binary tree index.

Third, you can verify if a query is using a columnstore index by looking at the execution plan in SQL Server Management Studio. Again this technique is no different than a traditional index.

Fourth, as expected, the columnstore will only be available in the Enterprise Edition only.

It took about 5 minutes to rebuild the columnstore index, which seemed reasonable when considering the table size I was working with.

The SQL below was used to create the columnstore index. Every column that was used in all the test reports were included in the columnstore index.

CREATE NONCLUSTERED COLUMNSTORE INDEX [invoice_line_IDX_CS] ON [POC].[invoice_line]

(

[year],

[period],

[item_no],

[cust_no],

[invoice_no],

[line_no],

[charge_cust_no],

[srep_code],

[quantity],

[selling_uom],

[ship_name],

[ship_addr1],

[ship_addr2],

[ship_to_city],

[ship_to_state],

[ship_to_zip],

[invoice_date],

[order_type],

[record_updated],

[total_pkg_qty_per_sku],

[total_product_quantity],

[net_sale_amt]

)WITH (DROP_EXISTING = OFF) ON [PRIMARY]

Two Execute SQL Tasks were added to the SISS package that loaded the main fact table. The first Execute SQL Task to disable the columnstore index was added just before the load and the second Execute SQL Task was added just after the load.

ALTER INDEX [invoice_line_IDX_CS] ON [POC].[invoice_line] DISABLE

--LOAD TABLE USING SSIS

ALTER INDEX [invoice_line_IDX_CS] ON [POC].[invoice_line] REBUILD

4 basic reports that represented common user requests in the environment were used to compare report performance when using Sybase IQ, SQL Server with a traditional binary tree index, and SQL Server with a columnstore index. The reports were executed using two different BI reporting tools. This is represented in the chart below as "A" and "B". One BI reporting tool is "A" and the other was represented as "B". The time represents the total report execution time in seconds which includes data retrieval and internal report processing by the BI reporting tool.

The reports used simple SQL. Nothing fancy or complex here. Report 1 simply summed invoice amount by all years, while Report 2 filtered on a single sales zone and then summed invoice amount by year. Report 3 was similar to Report 1 (summed invoices amounts by all years) but was based on a view that included extra columns. Report 4 was also based on the same view and was similar to Report 2 (filtering on a single sale zone and then summing on invoice amount by year) .

Obviously the newer, 64bit hardware SQL Server outperformed Sybase running on older 32bit hardware.

In most cases the columnstore index improved query time and was impressively fast, but it always was not faster than the binary tree index. Report 1b was actually slower and Report 2b was the same. This may have more to due with the processing time by the BI tool rather than the data retreval time, but a deeper look is needed here to understand this observation. Overall, I was pleased with the columunstore index as it was easy to setup and offered a noticeable performance improvement to most report users.

A columnstore index will help reports that summarize data on an aggregate level (SUM, MIN, MAX, AVG). A traditional binary tree index will still have their place to help retrieve a few rows using highly selective filters as this not really the strength of a columnstore index. This leads me to ask "As a BI practitioner, where does it make sense to use a columnstore index ?"

Speed up existing reports may be one idea. Some reports that may have not been considered online reports can now be deployed in an interactive online way. I would think that if this were truly the case you would have already created a SSAS cube. Interesting enough, I did have a project where I had to create a SSAS cube using HR data for this very reason. The report needed to run embedded in a .Net web application with a 1 -2 second response time. After spending time performance tuning and reindexing, I turned to creating an SSAS cube which ended up giving under 2 second response time for the report queries. In a case like this, a columnstore index may have been something to consider since it would have avoided having to create and process an SSAS cube.

A second idea was that a columnstore index can be used as a replacement for aggregate table. In 1999, aggregate tables seems to be more common. Now a days I honestly don't remember having a need for them on any recent projects. Today I would tend to rely on a SSAS or a Cognos cube instead. So I am not sure about this idea.

I also thought up third idea when thinking back on a previous client where I was working with a company on a medium size SSAS cube that contained quality data. The company had many poorly performing SSRS reports that used MDX and a SSAS cube. The company's IT staff was not comfortable with MDX or SSAS , but did have strong skills in SQL and SQL Server. A company like this may be an ideal candidate for a columnstore index. It can support their SSRS reports and take advantage of their strengths in SQL and SQL Server. I would think that there are cases where a columnstore index may meet the reporting needs and avoid the complexities that come with MDX and SSAS.

I would be curious to hear if you have any other ideas for potential applications of the columnstore index.

Friday, October 14, 2011

SQL PASS 2011 Summit in Seattle, WA

I wanted to share my top ten list on the topics I felt were significant and received a lot of attention at SQL PASS 2011 Summit. It was reported that this year’s SQL PASS Summit was the largest every with 5000+ reported participants. Personally, it is always fun to chat with other SQL Server folks and the SQL Server celebrities (i.e. authors, bloggers, the SQL Server product development team, and the SQLCAT team).

Without any other formalities, here is my top ten list of significant topics at the SQL PASS 2011 Summit:

Number 10 BI Semantic Model was something the Microsoft BI stack often got discounted for not having when companies did product evaluations against Business Objects, Cognos, and MicroStrategy or in reviews by Gartner Group. The BI Semantic Model (BISM) will expose the complex data structures in a simpler way that use business friendly terms which will become the foundation for self-service and ad hoc reporting for non IT staff.

Number 9 Hadoop connectors and Linux ODBC drivers will be available. These type of connectors will surround these technologies to provide a gateway to the Microsoft BI stack. The Linux ODBC drivers will be useful when doing migrations from databases on Linux to SQL Server.

Number 8 Powershell is the scripting language for administration for Windows, SQL Server, and SharePoint. There was a lot of buzz on learning Powershell to do setup and administration. Many of the setup tools actually use Powershell in the backend and provide the script before it is executed. This is a nice feature as the script can be saved to a file for reuse latter or be a good way to learn Powershell.

Number 7 Change Data Capture extended to Oracle was announced to be in SQL Server 2012.

Number 6 Data Mining - Noticeably missing in action was any update on the Data Mining Excel 2010 Add-On in the SQL Server 2012 release. Currently, the Data Mining Excel Add-On is only officially supported for Excel 2007. I was told by a SQL MVP that the SSAS team was focused on delivering Power Pivot v2 and Power View, but that updating the Data Mining Add-On for Excel 2010 is unofficially planned within the next year. There was a lot of buzz about using DMX queries to visualize data mining output in SSRS and using DMX queries to bring insights into SSIS or online web applications. This is a big positive. The message to me was that the pieces are in place to embed data mining insights to end users. I also had the chance to get familiar with a company called Predixion (aka the old Microsoft Data Mining team). They had an interesting cloud based analytics tool that used Excel 2010. It looked promising and to be tailored for small to medium sized companies.

Number 5 SQL Server Reporting Service (SSRS) will support data driven alerts created by end users. This will be a first version so formatting is somewhat limited, but the idea will be that a user can create their own alerts to be emailed to them when data conditions exist that require immediate action. Additionally SSRS will be available as a service application in SharePoint. This improves performance to get integrated mode on par with native mode and simplifies setup. SSRS will still be available in native mode, but integrated mode will be required for the new alerting feature.

Number 4 There was a lot of buzz about “AlwaysOn”. This is high availability (HA) feature that uses something know as Available Groups (AGs) which are a combination of the best of SQL Server Clustering (same IP) and database mirroring while not having a requirement for share storage. AGs support off site replication. This should make HA easier to deploy and reduce hardware dependencies.

Number 3 Improvements in information retrieval of unstructured documents (PDF, Word, and Excel) in the SQL Server 2012 database engine. There are performance improvements in Full Text Search, but the two more significant new features called Semantic Search and File Table. Semantic Search gives the ability to find documents that are similar to a given document. File Table is a new object is SQL Server 2012 and uses FileStream functionality to query the files in a given directory. The File Table object is update immediately as soon as files are copied to the directory the File Table is configured to. Semantic Search is very exciting to me.

Number 2 Column Store index is database structure that stores data using Vertipaq which allows data to be stored in a more compressed format to reduce physical IOs. Historically column store technology was a high end feature available in products like Teradata and Sybase IQ. The column store index will significantly improve performance of queries that scan many rows, such as queries that scan large fact tables. It was reported that queries taking minutes can now take seconds.

Number 1 Power View (formerly know as Project Crescent) is the ad hoc reporting application presented through a browser via Silverlight. Microsoft made a huge investment in this technology as it is seen as Microsoft's answer for mobile BI. Power View will allow users to create reports using the BI Semantic Model. The reports are very interactive and support self-service ad hoc reporting. It also supports animating charts on a time dimensions to visually show trend information changing over time. Power View is very visual and very eye catching. All in all, Power View was the buzz of PASS 2011. It carries a huge WOW factor that end users will be please with. With any good thing there is a catch. The catch here is that implementing Power View in most environments is not a trivial task. It involves multiple technologies, security topologies, and communication protocols. The challenge will be to understand SharePoint 2010, PowerPivot, SQL Server Reporting Services, Excel Services, SQL Server Analysis Services, BI Semantic Model, Sliverlight, and the various security architectures (classic Windows authentication and claims authentication) to successfully implement this solution in medium to larger environments.

Saturday, June 11, 2011

Using the SharePoint 2010 User Profiles to Implement Dimensional Level Security in SQL Server Analysis Services 2008

Building security into a SQL Server Analysis Service cubes to limit a users view of the data so they only see “their own” data is a typical client requirement. For example, a manufacturing company may want the sales representatives in the East region to see only East region data while sales representatives in the West region should only see the West region data. Senior managers and the Accounting staff will need to see all data for all regions. SSAS can support this functionality using the techniques described in Implementing User-Specific Security in SSAS by Bryan C. Smith and Dynamic Security in SSAS Cubes by Azaz Rasool.

SSAS relies on the Windows ID being passed using the Username function which returns “MyDomain\UserId” in order to identify the user. Often times the user ID in the source system application or ERP are not synchronized with the Windows Active Directory ID, which is what gets passed to SSAS. This is an issue because while we can usually get a dataset from the source system by user id that shows what regions or products that should be secured by user, this doesn’t necessarily match with what will be passed to SSAS. For example, my Windows ID maybe “Manchester\LNani” but my user ID in the ERP system may be “LuisNani”

Often times the solution is to simply build a mapping table to map the Windows Active Directory ID to the user ID in the source system application or ERP. This is not hard, but some kind of interface needs to be created and some sort of maintenance process needs to be put in place for this to be a long term solution. Typically nobody is really excited about spending budget, time, and effort to build a one off maintenance web application and then implement a process around it, but unless that is done the mapping tables becoming neglected and outdated. This ultimately compromises the user specific security in our SSAS cubes.

An alternate approach would be to use SharePoint 2010 as the repository. A customized field can be added to all SharePoint user profiles (ERP_ID in my example below). The advantages are now there is a place were the Windows Directory ID is mapped to the users source system application or ERP ID and the organization can take advantage of the user profile maintenances pages to add, update, and delete the customized field. No new pages or application needs to be developed. Additionally, an organization can simply add the population of the field as part of the normal process of creating a new SharePoint user profile. In some organizations this process is pushed down to the SharePoint administrator at the business unit/department level and is not an IT responsibility.

To use the SharePoint user profile data it needs to be extracted by a C# or VB.net program using the object model that Microsoft provides. I decided on a VB.net program mainly because it was more familiar within the organization.

The first step is to create the destination table in SQL Server with the DDL script.

CREATE TABLE [dbo].[SharePointUserProfile](

[FirstName] [varchar](100) NULL,

[LastName] [varchar](100) NULL,

[Title] [varchar](100) NULL,

[WorkEmail] [varchar](100) NULL,

[AccountName] [varchar](100) NULL,

[ERP_ID] [varchar](100) NULL,

[CreatedBy] [varchar](128) NULL,

[CreatedDate] [datetime] NULL

) ON [PRIMARY]

ALTER TABLE [dbo].[SharePointUserProfile] ADD CONSTRAINT [DF_SharePointUserProfile_CreatedBy] DEFAULT (suser_sname()) FOR [CreatedBy]

ALTER TABLE [dbo].[SharePointUserProfile] ADD CONSTRAINT [DF_SharePointUserProfile_CreatedDate] DEFAULT (getdate()) FOR [CreatedDate]

Then a VB.net program can be executed on the SharePoint 2010 server to extract the user profile data and write the output to your staging database. The user profile data is now available to your ETL process and can be used in your datamart and your SSAS cube. The program as written accepts 3 parameters for the SharePoint 2010 site, SQL Server, and database name.

Module Module1

Sub Main()

' Retrieving User Profile Properties in SharePoint 2010

Dim strArgs() As String

Dim strSite As String

Dim strDatabaseServer As String

Dim strDatabaseName As String

strArgs = Split(Command$, " ")

strSite = strArgs(0)

strDatabaseServer = strArgs(1)

strDatabaseName = strArgs(2)

Dim site As SPSite = New SPSite(strSite)

Dim serviceContext As SPServiceContext = SPServiceContext.GetContext(site)

'initialize user profile config manager object

Dim upm As New UserProfileManager(serviceContext)

Dim con As New SqlConnection

Dim cmd As New SqlCommand

con.ConnectionString = "Data Source=" + strDatabaseServer + ";Initial Catalog=" + strDatabaseName + ";Trusted_Connection=yes"

con.Open()

cmd.Connection = con 'Sets the Connection to use with the SQL Command

cmd.CommandText = "TRUNCATE TABLE SharePointUserProfile"

cmd.ExecuteNonQuery() 'Executes SQL Commands Non-Querys only

For Each p As UserProfile In upm

' Get profile properties

Dim FirstNameProp As UserProfileValueCollection = p.Item("FirstName")

Dim LastNameProp As UserProfileValueCollection = p.Item("LastName")

Dim TitleProp As UserProfileValueCollection = p.Item("Title")

Dim WorkEmailProp As UserProfileValueCollection = p.Item("WorkEmail")

Dim AccountNameProp As UserProfileValueCollection = p.Item("AccountName")

Dim ERP_IDProp As UserProfileValueCollection = p.Item("ERP_ID")

' Get string values from the properties

Dim FirstName As String = FirstNameProp.Value

Dim LastName As String = LastNameProp.Value

Dim Title As String = TitleProp.Value

Dim WorkEmail As String = WorkEmailProp.Value

Dim AccountName As String = AccountNameProp.Value

Dim ERP_ID As String = ERP_IDProp.Value

cmd.CommandText = "INSERT INTO SharePointUserProfile " + _

"(FirstName" + _

", LastName " + _

", Title" + _

", WorkEmail" + _

", AccountName" + _

", ERP_ID ) " + _

" VALUES ( " + "'" + FirstName + "' ," + _

"'" + LastName + "' ," + _

"'" + Title + "' ," + _

"'" + WorkEmail + "' ," + _

"'" + AccountName + "' , " + _

"'" + ERP_ID + "' )"

cmd.ExecuteNonQuery() 'Executes SQL Commands Non-Querys only

con.Close() 'Close the connection

End Sub

End Module

Notes:

There is another technique and set of objects that can be used to pull the data from a SharePoint web service. The advantage of this is that the code would not necessary have to run on the SharePoint server, which may not be allowed in some environments.

Development using the SharePoint 2010 user profile object model is best done on a server with SharePoint 2010 installed. This will ensure you have all the necessary DLLs available.

Saturday, February 5, 2011

Options for filtering parameter values in a SSRS drop down

I was working with a client who had a problem with their SSRS reports. Some reports had a parameter with a drop down that had thousands of values. This parameter was for Item SKU and this client had LOTS of parts for their products. Scrolling through the large drop down list was slow and cumbersome. The client wanted type ahead / autocomplete functionality to allow the user to type in part or all the Item SKU to filter down the values in the drop down list to a smaller list of values.

The three options I came up with to address this issue were:

Option 1: In SSRS 2008 users can now type the first few letters of their parameter which will position the highlighted value to that value. The problem is that this must be done EXTREMELY fast. If you not very quick with the keyboard this will not work. The delay between the first key stroke and the second key stroke will cause SSRS to think the second key stroke is the first keystroke and therefor position the highlighted value on a new parameter.

Advantages: Already available in the SSRS report viewers.
Disadvantage: Not usable for users who type slower than 240 words per minute.

Option 2: Add an optional parameter value called “ItemSKU filter” to pre filter the Item SKU drop down. The default behavior when a user enters the report would be blank for Item SKU filter and show the massive list of values for Item SKU. If the user enters a value in the ItemSKU filter prompt, the string entered will be used to filter what is presented in the Item SKU drop down box. If the user enters the entire Item SKU then only one value will be presented in the drop down box. The user will still have to make a selection in the Item SKU drop down, which adds one more mouse click. This technique was described by Ella Maschiach in her blog.
Below shows the default behavior when the user enters the report. Initially, Item SKU filter is blank and all Item SKUs are shown.

If the user enters in a value in the ItemSKU filter prompt, the available values in the Item SKU are filtered. A user can type in a complete Item SKU to limit the list to one product.

The one difference from Ella Maschiach’s example was her example used SQL against a database. My client was using MDX against a cube. The query that populated the available values for Item SKU had to be modified as below.

SELECT NON EMPTY { } ON COLUMNS,
IIF(LEN(@ItemSkuFilter) =0,
([Product].[Item SKU].[Item SKU].ALLMEMBERS),
(FILTER([Product].[Item SKU].ALLMEMBERS,
INSTR ([Product].[Item SKU].CURRENTMEMBER.MEMBER_CAPTION , @ItemSkuFilter) > 0))) ON ROWS
FROM [ProductCube]

The trick was to use replicate a SQL LIKE operator in MDX. Jason Thomas had an example of this posed here. I could have also used the LIKE function found in ASSP – Analysis Services Stored Procedure Project. This project found on Codeplex.

Advantages: Easy to implement with in SSRS. Each report that would need this can be modified under an hour.
Disadvantage: Adds an additional mouse click if a user wants to enter a specific Item SKU which is not really much of a time saver.

Option 3: Consider creating autocomplete functionality using Ajax/jscript. This would have to be implemented in a .Net application or a custom AJAX report viewer before the report is executed in SSRS. The application would have to handle populating all parameter values and then calling the report in SSRS. The upside is that our user experience for our massive drop down list can be greatly improved. This type of autocomplete functionality is similar to what you see in Google or Bing search. Perhaps Microsoft can add this kind of functionality in a future SSRS release. Ben Scheirman had a blog posting on solution.

Advantages: Best of breed approach
Disadvantages: Higher cost to implement.