One fundamental question that we as data warehouse and business intelligence professionals need to ask is: What goes in a warehouse, and what stays out? The question is about keeping the warehouse for intelligence and not for operational purposes. It is the constant struggle. Users need more functionality all the time, and many of today's off-the-shelf packages make it difficult to add functionality. This results in users requesting the warehouse to perform operational tasks. This is the road to failure. We must start to understand that if users need an operational reporting system, then build that. If they need a data warehouse and reporting system, then build that. One way or another, keep the operational and warehouse functions separate. Build data marts to address certain needs, but keep your warehouse as pure as possible.
So that's it for me for now. If you enjoyed these notes please let me know. Drop by and meet me at Oracle Open World or IOUG Live, I would love to exchange ideas with you all. We must keep learning and learn from each other; experience will help us all in forming our futures.
Posted by Ian Abramson
Improving your SQL
21 OCT 2004 05:05 EDT (09:05, GMT)
In our data warehouse and reporting environments, it is always a challenge to provide data quickly and efficiently. It always amazes me to see that many people run SQL that is written quite inefficiently. End users rarely take the time to determine the time necessary to run a query or even to evaluate the quality of a query with a simple execution plan (aka the explain plan). The quality of your SQL will impact the quality and timeliness of your results. So if you have queries that are running slowly, here are some methods you can use to improve overall performance in your database.
First things first; Oracle uses an optimizer called the Cost Based Optimizer (CBO). The CBO looks at statistics contained within the database to determine the most efficient manner to run a query. It decides if it should use an index or not and will determine the join methods that will be most efficient. As well, it looks into your hardware profile, which includes machine speed and disc speeds. All of this will help Oracle make a correct decision for your query. However, the biggest problem we see today is stale statistics. It is imperative that statistics on your tables are correct. To do this, you need to ensure that the DBA runs a package called DBMS_STATS that will update the statistics on a table, index or partition. Do not use the ANALYZE command if you are running a newer version of the database.
Next you need to learn how to read the results of the EXPLAIN PLAN command. These results tell you how Oracle will execute your SQL. Looks for items like full table scans and Cartesian products as these are usually some places where you can improve performance.
The final area that I will discuss is the SQL that you write. Oracle in version 9i and 10g has added many new SQL functions and features that will help data warehouse users. These include data windowing, rolling windows and models. These features provide us with the ability to produce complex reports while minimizing how much data is read from the database. Other functions like the median function will produce the results that were difficult to use in the past. Look at these features and see how they can help you. You need to remember that most reporting tools write generic-type SQL, and you sometimes will need to tweak this to improve performance.
There are many approaches that you can take with your SQL, but the biggest improvements can be made with simply reviewing your SQL before running it. Take the time to get things right.
Posted by Ian Abramson
The strike
20 OCT 2004 06:09 EDT (10:09, GMT)
OK, I have waited long enough for this topic. It is now time for me to discuss the NHL lockout. Since I am a Canadian, I love hockey. I also know that most other people in the world, Americans included; do not. Hockey is a secondary sport, ranked behind NASCAR, but it a sport whose fans are very passionate.
Lockouts and strikes have always been an issue; however, I feel that this one makes a lot of sense from my perspective. Other major sports have implemented salary caps, which have actually improved the competitive level of the game. Think of the NFL -- every team has a chance to win, as proven by my football selections each week. Now turn to major league baseball, a sport without a salary cap. I was a Montreal Expos fan from Day One, but as the team could no longer afford to compete and had to trade away their best players, I lost interest in them. My problem was: Why support them if they have no chance to win? (Their player budget was less than the salary of the middle infield for the Yankees.) It wasn't out fault; I think it was an effort on the part of MLB to destroy the game for Montrealers. Of course, if you are a Yankee or Red Sox fan, you don't care about the teams that can't compete with you. It is all it about winning. Who cares about the losers? I cared about my losers; I loved them and they will be missed.
So how do we make sure that our warehouse is on an even playing field? I feel that Oracle partitioning is the answer. Oracle allows you to partition data into separate logical and physical components by using data partitioning. This feature allows for significant performance gains, as only data that needs to be read is accessed. In warehouses that tend to have large data volumes, it does not make sense to read entire tables. Often you want to read data for a one-month period, and other times you want to read all of your data. In the first case, it does not make sense to read the entire table; with partitioning, you can simply read one partition and minimize the effort to get your answer. When building partitioned tables, you need to review the nature of the table. So, if you have a fact table that needs to be partitioned, you would look at range, list or composite partitioning, whereas you will find that most dimension tables will benefit most from using hash partitioning. If you are building any large tables in your Oracle database, I would suggest you use partitioning. Think of partitioning as your database salary cap, which allows all of your data to play on an equal footing.
Posted by Ian Abramson
Building with standards
19 OCT 2004 06:04 EDT (10:04, GMT)
So here we are, more than week into this wonderful experience on TechTarget. I have really enjoyed getting my thoughts about what I do every day down on "paper." Maybe I should write my memoirs. I always thought you needed to be important until I heard about a friend of mine who is writing his memoirs, but he is doing it to create a history of his time on this planet. I thought this was a great idea, but it seems like I should have started a long time ago.
Today, I was presented with a problem. As part of the next release, our warehouse was asked to create some functionality around a customer who cancels a contract. It turns out that another group has developed the same functionality that linked directly with the operational system. So the question became: Whose calculation was correct (or should I say, better)? To me it identified a real issue in business today: standards. We build warehouses to provide a standard approach and interpretation of business rules. We understand that at times you need to develop something to bridge a gap, but it is necessary to make the warehouse a central hub to your information needs. By ensuring that your company does implement standards, you will reduce the number of opinions that an organization can have. The data warehouse can help you to define these rules and ensure that the business finds the correct answer each time without having consistent results. In the end it was decided that the warehouse did not need to perform the calculation. This provided us with the advantage of having one place in the business where the calculation is performed, which will reduce maintenance and provide for a single right answer. You should keep this in mind when deciding if functionality is required in your data warehouse.
Posted by Ian Abramson
The Dynamic Duo: Speeding to the rescue
18 OCT 2004 21:24 EDT (01:24, GMT)
Here we are, a week into my experience on the Expert Answer Center. It has been a truly great experience, and I hope everyone who reads this is enjoying it, as well. Another weekend has passed; they always seem too short, and I still can't get enough sleep. Today I want to discuss some of the things that you can do to improve performance of your BI reporting. I want to thank my great friend and Cognos expert, Ted Falcon, who helped me with my blog today.
As we all know, the Lone Ranger had Tonto, the Green Hornet had Kato, Batman had Robin and data warehousing has BI. In all these examples, you can't have one without the other. These two entities will forever be intermingled and synonymous. DW/BI have been the "superheroes" of the business world.
Allowing businesses to make informed decisions by utilizing the data that they already possess quickly and efficiently. Quickly is the key word here. Speed is probably the most important factor in BI, apart from the obvious data integrity. If your data is correct in your warehouse, then the next thing you want out of your BI solution is speed of delivery. In order to achieve this, BI queries, cubes and ultimately reports must be optimized and fine-tuned to deliver the ever-important "numbers." This is where team work between the data warehouse developers and Cognos developers comes into play. When these two groups of people put their minds together and develop an efficient solution, the users will express their appreciation tenfold. However, if the solutions take too long and timelines are not met, well, then, that could make for a very long day at the office.
So what can we do to improve performance of our BI reports? Actually there is a lot you can do. Things such as proper indexing, summary tables and views can help a Cognos developer create fast Impromptu Reports and PowerPlay cubes. Indexing correctly can improve reports by allowing the query to address data in an efficient way. Look into bitmap indexes and index histograms; these indexing features can provide substantial gains for queries. Another option is the summary table. A couple of years ago when I worked with Ted, there was a cube that was taking 10 hours to build. It was not an over complex cube, but it was doing summarizations and other functions. I suggested we build a summary table to help this cube along. The daily construction of the summary table added 10 seconds to our load jobs, but more importantly it reduced the cube build to 10 minutes.
From a Cognos perspective, the developer can tweak the transactional queries in the cubes. To do this, perform an explain plan on your transactional query and see if there is a way to tweak your query to attain a lower cost and, therefore, cause your cube build time to decrease. They can separate their queries into smaller, more efficient, parts, as well. Splitting your cube construction into two query-types results in this improvement. There are dimensional and transactional queries. The transactional query is your basic data query, without joins to your dimensions, and, of course, the dimensional query retrieves the dimension data. The information is joined within Cognos and generally will result in improved performance.
Simple things like this can save you time and hail the praises from both your users and your data warehousing cohorts. As we all know, if the users are happy then the DW/BI team is happy, too. Wait a minute -- I think I hear finance approaching with another request; sounds like a job for DW and BI!
Posted by Ian Abramson
Oracle and the data warehouse
15 OCT 2004 05:55 EDT (09:55, GMT)
Today, I think we will talk about Oracle. Specifically we will discuss some data warehouse features. Recently my SELECT Journal from the IOUG (International Oracle User Group) arrived at my door. As usual it contained numerous articles that provided me with technical fodder to add to my Oracle toolkit. However, this past month there was an article authored by Jeff Maresh on the subject of star transformations. As you already know, Oracle is a very deep product when it comes to features that they embed in it. As data warehousing people, we already know the top features that every data warehouse will use, such as partitioning, analytic SQL, bitmap indexes and much more. One of the features that we like is the star transformation. With careful reparation you can see incredible increases in performance. By creating tables intelligently and knowing the query patterns of your users, you can optimize how a table is deployed and its performance at the same time.
Some things to think about when deploying fact and dimension tables:
- Base partition on the most logic separating column; this is generally a date column.
- Create bitmap indexes on individual columns.
- Define foreign key/primary key relationships, they do not have to be enabled, but the optimizer does need them defined.
- Use local indexes, or should I say minimize the use of global indexes, for maintenance reasons.
- Always use dbms_stats when analyzing a partitioned object.
- Build histograms for the optimizer.
So this then leads us to the star transformation. The star transform is the optimizer's way of making the most of a query, but reducing the amount of data that is needed to be reviewed and retrieved by the database. The star transform can give you huge gains in query performance and is a feature that you need to review for your own warehouse. There are numerous initialization parameters that you need to set, which include:
- Star_transformation_enabled
- Hash_join_enabled
- Sort_area_size
- Sort_area_retained_size
- Bitmap_merge_area_size
These parameters will all need to be tuned to achieve the full potential of star transformation. The bottom line is that you need to investigate how to best achieve performance gains for your own database, but this is one route that can help get you there.
Posted by Ian Abramson
Dimensionally speaking
14 OCT 2004 05:52 EDT (09:52, GMT)
Last night was pretty boring. I got a new refrigerator. I guess that could be considered fun, but I figure I should just get into today's blog. Today I would like to discuss data warehouse dimension tables and how to simplify your data access and build a flexible and extendable solution. Sounds just like yoga!
As we all know, every data warehouse contains dimensions, and every BI solution has some dimension to its solution, so dimensions are important. It is dimensions that empower your data in your warehouse. Dimensions are your windows to your data. They form the basis of most constraints that we use in our queries and also allow for the categorization of data as you utilize them in reporting. So dimensions are critical to your warehouse and business intelligence success.
Dimensions come in all sizes and shapes. A dimension of customers can be large with many attributes. It is these attributes that truly empower your warehouse. So when designing a dimension, try to add columns that will be useful to your users. For example, we have an address dimension in our warehouse and we include the postal code (a Canadian zip code); we also include the FSA (first three digits of the postal code, which is used for mail sorting) as a field. Although the data is redundant, it is used on a regular basis. So instead of forcing users to pull out the first three characters each time, we do it for them when we populate the table. This provides the performance and flexibility that the users needed.
Dimensions should always use a surrogate key that is generated within the warehouse. I went to a presentation a couple of years ago by Ralph Kimball (a data warehouse author), and he discussed the importance of removing the warehouse's dependency on business keys. The idea is a good one, because business keys change regularly and this will result in a long-term problem for the warehouse. However, when we discussed Slowly Changing Dimensions (especially ones that kept history), he said that we should use the business key to link them together. This went against what he had just said, so I decided that we needed to find another solution.
The solution that we came up with was a double key. One key for the dimension that uniquely identified the record, and a second one that linked together the records that are grouped. You can think of this as a generated business key. We also added a current record indicator. There is an entire presentation just on this topic, but to summarize, the creation of this double key provided us with so much that it far exceeded our expectations. Now we could report on data based on dimensions at the present time, as well as reporting on data as it was when the record was originally created. So always build for practicality, not for art.
Dimensions in your BI reporting facilities should be deep. When building your BI reports you need to ensure that users have the ability to drill into dimensions and drill across to others. The dimension is really the best way to empower your reporting. If it is good to report by state, it is better to report by city or maybe by store. By building your BI dimensional maps to support this, you will ensure BI success.
Bottom-line: Dimensions are critical to your success and your warehouse's success.
Posted by Ian Abramson
Make sure your data warehouse is really flexible
13 OCT 2004 12:16 EDT (16:16, GMT)
Wow, here we are at Day 3 already! Tuesday is the day I do yoga. Believe it or not, I admit it: I go to yoga class, mainly because he it helps me with hockey. Yoga helps you with your stability and your flexibility -- both good things for a hockey player. Of course, this connects to my thought today: Build your information factory with flexibility built in from the start.
Data warehouses and business intelligence are information systems that require the ability to adapt to the changing needs of the business. We build warehouses to satisfy business analysis needs, and these needs are changing constantly. I find that the best way to improve a warehouse is to build BI systems. Business intelligence is the gateway to the data and can form the first line of data access that many organizations. So how do we build flexibility? Let's look at a few guidelines and see how they can help build a solution that will solve short-term data needs while not limiting us in the future:
- Beware of your grain
- Don't build to specific reports
- Keep dimensions simple
Let's now look at each of these guidelines.
The grain of a data warehouse is the easiest place to begin when looking for a flexible future. When building a warehouse, we always define the grain or level of summarization that the data will be stored with. So we can store individual sales transactions or we could store sales by store by month by product. Just by looking at these two alternatives, we can start to see that by storing low-level transactions, we will be able to build on that to any level of grain. However, with the second we may not be able to respond to all BI needs due to the fact that data is summarized. The lesson here is to not over-summarize your data.
Next, we should not build to specific reports. I find that too often we build warehouses and warehouse components to satisfy specific reports instead of information needs. Often tables in a warehouse are built to provide data to one report, in one way and with no flexibility. This is not good. I remember a data warehouse I helped to design for the government. The user wanted us to recreate a report that was built over 20 years ago. They had run this report for 20 years and did not see any reason that the report today would not suffice. After all, if it has worked for the past 20 years, why change things? Of course, we created a flexible design that would satisfy the reporting requirements but made a quantum leap in terms of extending their analytical abilities. We presented our solution and were told by the user that it was too complicated and they wanted a table that matched the report format. We explained the advantages of our solution and the long-term benefits, but they did not understand why anyone would build for the future when the past worked so well. As they say, "You can lead a horse to water..."
The final point for today is the idea of keeping your dimensions simple. Keep the grain of the dimension basic (no summarization). Avoid recursive relationships; they tend to complicate reporting and always remove your dependency on business keys and utilize data warehouse (surrogate) keys.
I hope these points inspire you to build for the future of your warehouse. A warehouse is a design built in Jello. It will constantly shift and change and as such you need a design that can jiggle into each new requirement as they emerge over time. So build with the future in view when you build your warehouse.
Posted by Ian Abramson
Life as a goalie
12 OCT 2004 05:08 EDT (09:08, GMT)
Day 2. Last night was hockey night in Canada. I play ice hockey, and it is as big a part of my life as are data warehouses. There is nothing better than getting on the ice and feeling the cold and getting ready to take on your opponent. I am the goalie, which is appropriate for my personality. I like the idea of being the player who is sometimes the hero and other times the dog. Goalies win games, and they also lose games. This is much like a data warehouse. Data warehouses are easy targets, just like goalies, for placing blame when data goes bad. So I was wondering how many people consider data quality when sharing information.
I have worked on a number of data warehouses, and each one had its challenges of data quality. Some companies don't know who there customers are, while others can't figure out how many products they sold. However, it is the data warehouse that becomes the central figure in the quest for quality. Often companies "fudge" their numbers when performing reporting; they do this because spreadsheets are easy to change to give you the information you want. However, when you bring in data warehousing and business intelligence, you remove this ability. One of the basic ideas behind a data warehouse and BI is that results are auditable and repeatable; however, data quality often affects this ability, resulting in reports that are misleading. So how can we solve these problems?
I think the data warehouse should become the corporate conduit of change. A data warehouse is very good at illustrating how operational systems lack the required diligence to ensure quality. The warehouse is a receiver of problems, not usually the creator. We need to use the warehouse to find operational issues, but the business must be willing and able to affect change for the overall good of the company.
So by creating processes that provide the following abilities, you should be better able to handle and manage data quality:
- Assign a person to be the corporate information officer. It will be their responsibility to manage data quality and ensure that data reflects reality.
- Build in validations and auditing to ensure that the data that comes from your information sources can then be traced to your warehouse.
- Create reporting facilities that focus on quality reporting. These reports are needed to balance the warehouse to the operational sources.
- Use tools that help you to improve data quality. These include extraction, transformation and loading (ETL) tools, as well as data matching tools. Creating an easily manageable environment will improve the ability to respond to quality issues.
- Communicate quality and lack of quality to the user community. They need to understand their data and their limitations.
The bottom line is that quality is job one, and data warehouses need to give their information users in a company a true sense of confidence. Goalies provide a team support and allow them to excel at their own jobs. The warehouse will form your information goalie, just like I do for my hockey team, and allow your business to get on with business.
Posted by Ian Abramson
Data warehousing and business intelligence: What a pair!
11 OCT 2004 06:00 EDT (10:00, GMT)
So here I am with my very first blog. Over the next two weeks I will provide everyone who reads this with some help in building, maintaining and optimizing his or her data warehouse and business intelligence environments.
I started this blog and wondered: Where did the word blog come from? Turns out that its origins are in the word Weblog. This, then, became the wee-blog and since, as with many things in computers, it was soon shortened to blog. So it's a log of my life in warehousing.
We should start at the beginning. What is a data warehouse? And what is business intelligence? These are two fundamental questions that I will try to answer in a quick and concise way. A data warehouse is a collection of information from various sources in a central repository. This repository is then available to an organization for use in reporting and analysis. Check out the more detailed definition here.
So then what is business intelligence (BI)? So many people don't see any difference; some people think that BI and data warehousing are the same. Of course, they are different; they complement each other. BI is the collective science of collecting, analyzing and presenting information to a business. It does not require a data warehouse, since it can collect data from operational -- as well as offline -- sources. Here are some more details for you.
As it turns out, BI and data warehousing do live well together, and they are better together then separate. I see BI people as the rock stars of knowledge management. They create reports and cubes and all sorts of analytical results to allow businesses to optimize their operations. It is these reports that management sees and for which BI programmers get the credit (sometimes not), and this makes them the stars. However, it is the data warehouse team that keeps everything rolling; they are Information Roadies. The Information Roadies are the backbone of a successful enterprise information strategy. So bring on your questions in the next few days, and I hope that I can enhance your data information experience.
Posted by Ian Abramson