SQL
Making the switch
29 JUL 2005 21:08 EDT (01:08, GMT)
Well, the last two weeks have been fun. I remain unconvinced that a daily SQL blog offers the same connection to the zeitgeist as normal blogs (normal blogs being, of course, about the author and the author's experiences, feelings and thoughts). On the other hand, a good article is timeless, and I shall continue to write about SQL here at TechTarget and elsewhere, as the opportunities arise.
The number of questions submitted was lower than I expected. Maybe this is just because it's summer. Plus, people don't usually have SQL questions until they have them, and maybe, by chance, few people had any, these past two weeks.
If later you come across these blog entries, and say to yourself, "Self, this guy looks like he might know the answer to my SQL question," then please feel free to send it in. The TechTarget folks will get it to me somehow. You may also contact me through my Web site, r937.com.
To close off my two weeks at the Expert Answer Center, today I'd like to talk about making the switch to SQL. Here's some sample data:
a1 b1 c1 d1
a1 b1 c3 d7
a1 b1 c5 d9
a1 b2 c1 d1
a1 b2 c2 d2
a1 b2 c4 d4
a2 b1 c1 d1
a2 b1 c1 d2
a2 b1 c1 d3
a2 b1 c1 d4
a2 b1 c1 d5
Here's the situation described by the person who submitted this problem to a database discussion forum the other day. The query:
SELECT DISTINCT a,b,c,d FROM x
produces distinct values of a,b,c,d. But the person wanted only distinct a,b (no matter what c,d are).
If you were processing these rows yourself, you'd first sort the entire table into a,b sequence (as they are in the sample data shown above), and then start reading rows. The first row for each a,b combination, you'd keep. Then you'd discard any further rows with the same a,b value. When you hit a new combination of a,b, keep that row and carry on like that until you get to the end of the table (and then, as the king in Alice in Wonderland says, stop).
The problem is that SQL doesn't work like that.
A suitable solution is this:
SELECT a,b,max(c),max(d) FROM x
GROUP BY a,b
Because of GROUP BY a,b, we know we are going to get each combination of a,b only once. We use aggregate functions on c and d in order to provide a single value of each of them to the grouping operation.
The trick here is not to be surprised that max(c) and max(d) might not actually have come from the same row. So for that combination of a,b (which is present only once in the result set of the query because of the GROUP BY), the values in all columns in a given result row might not actually be a row in the table.
That's the switch I mentioned. You see, you actually can write a procedure that walks through the sorted table and discards rows as described before (using a cursor and procedural SQL). But why would you? It's more difficult, takes way more code and -- here's the strongest reason not to do it -- it's as slow as cold glue.
Make the switch to thinking in terms of sets. The sets here are the groups of rows of the table, grouped by a,b. You may use any of the columns in the table, but only in aggregate functions. There's a good chance you can get what you need very simply.
Posted by Rudy Limeback
Trials, tribulations and pleasant surprises with vendor-specific SQL
28 JUL 2005 15:19 EDT (19:19, GMT)
If you've ever worked with more than one database system, such as Oracle, DB2, SQL Server, MySQL and so on, you will immediately have noticed that there are differences in the SQL that you can use in each of them. The first time this happens, your reaction is probably "Hey, wait a second, isn't SQL supposed to be a standard language? What's up with this?"
For example, in standard SQL, you concatenate two strings like this:
string1 || string2
Oracle supports this syntax, but in SQL Server, you must say:
string1 + string2
In Access, depending on how you want NULLs handled, you may say either:
string1 + string2
or
string1 & string2
In MyQSL, you must say:
concat(string1, string2)
And that's just scratching the surface. There are even more differences waiting for us in handling dates. Suppose we need to select rows from a table with a WHERE condition on a DATE_UPDATED column, such that the date occurred within the previous 10 days. You won't believe the different shenanigans we have to employ to do this efficiently in each database system.
Is there anything we can do to cope with this problem? I once learned an old saying, which goes something like this:
There is no such thing as a problem. Either it has a solution, in which case it isn't really a problem, it's an opportunity waiting for you to apply the solution, or else it has no solution, in which case it isn't really a problem either, it's a fact of life.
It's a fact of life that different database systems use different "dialects" of the same language. We simply learn to deal with this. If we forget which syntax to use in a particular scenario and accidentally use the wrong one, we'll usually get a syntax error. So off we go to Read The Fine Manual, and after a while, we remember the different syntaxes. And let's not overstate the issue, the various dialects have a lot more in common than they have differences.
Of course, we also always do a sanity check on our query results, just in case the syntax actually runs (technically it "works"), but returns the wrong result. I helped someone the other day who couldn't understand why his query returned all the rows in the table, when clearly his WHERE condition for the DATE_UPDATED column used the correct syntax:
where date_updated >
date_add(current_date, interval -10 day)
Upon further investigation, it turned out that the DATE_UPDATED column values looked like 1118811600, 1120194000 and 1122094800. The column had been defined as INTEGER, and the application was storing Unix timestamp datetime values. The database in this case was MySQL, and it is hard to criticize MySQL for allowing a comparison between an integer and a date -- it just converted the date to an integer, 20050725, which of course is less than any Unix timestamp.
In fact MySQL, which seems to have a very bad reputation amongst people more familiar with other so-called "real" relational databases, has a few tricks up its SQL sleeve that other databases are lacking.
Here's a great example. Suppose we have a many-to-many relationship in our database, where each thing can belong to multiple categories. Perhaps the things in our database are people and the categories are personalities, and we want to categorize people by adjectives like "happy," "kind," "helpful," "crabby" and so on. There would be a people table, and, since we wouldn't dream of building a table that had a list of words inside a single column (this would violate first normal form), there would be an adjectives table and a many-to-many relationship table to categorize people by adjectives.
Now suppose that, despite having stored our data correctly, we wanted to produce a report that listed each person once, along with a list of the adjectives that describe that person:
person adjectives
curly happy, simple, silly
larry quiet, shy
moe crabby, bossy, mean
In MySQL, this is a piece of cake:
select p.name
, group_concat(a.adjective) as adjectives
from people as p
inner
join person_adjective as pa
on p.id = pa.person_id
inner
join adjectives as a
on pa.adjective_id = a.id
group
by p.name
The MySQL GROUP_CONCAT function is an aggregate function and produces a delimiter-separated list from a column of values. Sybase has something similar, the LIST function. But look at how difficult this is to pull off in Oracle:
Function to concatenate output.
Moral of the story? Yes, each database system is different, and sometimes wildly so. The differences aren't always all bad, either.
Posted by Rudy Limeback
The difference between WHERE and HAVING
27 JUL 2005 23:51 EDT (03:51, GMT)
Ever wondered about the difference between WHERE and HAVING? Today we'll look at three simple queries based on the following Transactions table:
| name | type | amount |
| foo | 1 | 42 |
| foo | 3 | 12 |
| foo | 4 | 35 |
| foo | 5 | 63 |
| foo | 8 | 12 |
| bar | 3 | 24 |
| bar | 5 | 13 |
| bar | 6 | 72 |
| bar | 8 | 62 |
| qux | 0 | 65 |
| qux | 3 | 32 |
| qux | 5 | 84 |
| fap | 2 | 11 |
| fap | 8 | 12 |
| fap | 9 | 37 |
Here now are three very simple queries. The first features a WHERE condition:
select name
, type
, amount
from myTable
where amount > 34
This query selects only those rows that satisfy the WHERE condition,
where the value of the amount column on that row is greater than 34. Now consider the second query:
select name
, sum(amount) as total_amount
from myTable
group
by name
having sum(amount) > 58
Nothing really new here. This query selects all rows, groups them together by the value of the name column, calculates group sums and evaluates the HAVING condition to make sure that the sum for each group (name) is greater than 58.
Now consider our third query:
select name
, sum(amount) as total_amount
from myTable
where amount > 34
group
by name
having sum(amount) > 58
This is also a normal query, even if it's somewhat more complex than you might have seen before. The rows are first filtered by the WHERE condition, so that only those rows where the amount is greater than 34 are passed to the grouping operation. Then the groups are formed, and the sum for each name is calculated. Finally, the HAVING condition is evaluated to make sure the sum for each group is greater than 58.
But now let's consider this:
select name
, type
, sum(amount) as total_amount
from myTable
group
by name
, type
having type > 5
Does this work? Yes. Is it efficient? No. What happens is that all the groups are formed, when they don't have to be.
Whenever it needs to do a GROUP BY, the database engine has to create a temporary or intermediate table, where it can sort the rows that are being grouped. This internal sort will take a lot longer if all the rows of the table are present. The rows are collected and sorted, and then the groups are formed and the sum calculated. And then the results are thrown away if the value of the type column is not greater than 5.
Instead, that last query should be written like this:
select name
, type
, sum(amount) as total_amount
from myTable
where type > 5
group
by name
, type
Now fewer rows go into the grouping operation, with the result that
the query is more efficient.
What's the secret to knowing whether to use WHERE or HAVING? The HAVING condition should involve only aggregate expressions.
Posted by Rudy Limeback
Queries across different database systems
26 JUL 2005 15:33 EDT (19:33, GMT)
If you were not aware, I also appear regularly in an Ask the Expert series on another of TechTarget's sites. There are actually many of these series, and although mine resides on the SearchOracle.com site, it's about SQL -- not Oracle SQL, just SQL -- and the concepts within SQL that you can apply to any database system.
One of yesterday's answers, Why I love VIEWs, is probably one of my best efforts (at least by my criteria, and I can be harsh on myself).
The idea I tried to illustrate is that you can use a view in a query such as this:
select t1.agent_id
, t1.client_id
, t1.total_prem
from client_totals as t1
left outer
join client_totals as t2
on t1.agent_id
= t2.agent_id
and t1.total_prem
< t2.total_prem
group
by t1.agent_id
, t1.client_id
, t1.total_prem
having count(*) < 25
order
by t1.agent_id
, t1.total_prem desc
In case you missed it, the view is called client_totals. In most database systems, the view is created like this:
create
view client_totals
as
select agent_id
, client_id
, sum(premium) as total_
from policies
group
by agent_id
, client_id
As I mentioned in my postscript, you can use this feature in Microsoft Access by simply saving the SELECT statement in the view as a query with name client_totals.
This is another reason -- there are many -- why I love Microsoft Access. It has great support for views. In Access, the big query would be a saved query, and the view would be a saved query (minus the CREATE VIEW ... AS syntax, just the SELECT). By changing the column mentioned in the SUM aggregate function in the view, you change which top 25 clients per agent the big query produces -- by total premium or by number of claims or by...
What?! I like Access?
Yes, and I've admitted it before. However, I usually try not to mention it. On occasion, my remarks about how some data manipulation or another is a joy to perform in Access have been interpreted by one or more well-meaning but occasionally mean-spirited individuals as an opportunity to educate me as to why Access couldn't possibly be a "real" database. To spare them and me both the wasted effort such a conversation would engender, I now usually just say "You most likely are completely right. I shall continue to use and love Access anyway."
Besides the ability to implement views, another feature I like about Access is the ability to link to external data. Using ODBC drivers, you can link to tables in Oracle, SQL Server, other Access databases, even CSV files. You can even use these external tables in JOINs. Normally, queries that you execute within Access are handled by what's called the Jet engine. In the case of a query which references an external table, Jet will first extract from the query's WHERE clause any filter conditions pertaining to the external table, use those conditions to build a simple SELECT statement against just that external table, and then send that request off via ODBC. When the results come back, Jet simply treats them as another table. You can even do a JOIN between two different external tables, such as one from Oracle and the other from SQL Server. The filtered rows from each external table are pulled into Access, and the Jet engine performs the join.
I love how easy that is.
Let's face it, not everything we as data professionals need to do will always afford us the opportunity to implement a proper solution consistent with the corporate three-tiered architecture (I'm not sure I even know what that means). If the corporation has customer details on Oracle and sales transaction details on SQL Server, and the vice president of marketing wants a summary report in her inbox in "softcopy" format (i.e., an Excel report) by noon, and you are the person selected to produce it for her, then you will be happy to know that Access offers a really quick and easy way to do it.
Posted by Rudy Limeback
Looping in SQL
25 JUL 2005 14:56 EDT (18:56, GMT)
Today, I have a challenge for all you data professionals out there.Suppose you were given the following data file: | type | name | date | qty |
|---|
| apple | mcintosh | 2005-06-14 | 3 | | apple | mcintosh | 2005-06-27 | 2 | | apple | mcintosh | 2005-07-03 | 4 | | apple | spy | 2005-07-01 | 1 | | apple | spy | 2005-07-05 | 2 | | apple | royal gala | 2005-06-05 | 6 | | apple | royal gala | 2005-06-06 | 3 | | orange | navel | 2005-06-15 | 3 | | orange | navel | 2005-06-17 | 3 | | orange | navel | 2005-06-18 | 3 | | orange | valencia | 2005-06-20 | 1 | | orange | valencia | 2005-07-20 | 1 |
How would you go about taking this data file and producing the following report: Would you... load the data file into a database table? normalize the data into fruit, fruit type and fruit transaction tables? use an OUTER JOIN for each of the months? Notice the zeroes on the report, where there are no zeroes in the data file use a GROUP BY? On which columns? use a cursor in a stored procedure, toaccumulate the quantities per month as you traverse thedata file, printing totals as you go? some of the above? none of the above?
The correct answer is 7, if you have reporting software. The wholepoint of any decent reporting software product worthy of the name is that it can take a data file like the above and produce "summary" reports like the above without breaking a sweat. But what if you needed to do this in SQL? More specifically, whatif you have already invested some effort into producing the abovedata file yourself with other queries? For example, suppose that the above data file for apples and oranges actually came from two disparate sources, such as different corporate databases, and is actually the result of a UNION query, which itself already uses GROUP BY clauses to achieve total quantities per day. How do you take a query result and produce a formatted report like the above? As a hint, consider the title of today's entry: Looping in SQL.What do you think I meant by that? Please send me your comments, questions, or suggestions about this problem. I'll write up the best submissions (I won't use your name unless you want me to), and then in a day or so, I'll give you my answer.
Posted by Rudy Limeback
COALESCE and NULLs
22 JUL 2005 22:19 EDT (02:19, GMT)
Have you ever asked yourself, "Are NULLs OK to use? What are NULLs, anyway?" One of the best articles I've seen on NULLs is 4 simple rules for handling SQL NULLs. If you aren't sure what the issues are with NULLs, please go and read that article now. I'll still be here when you get back.NULLs are scary, right? They do not have to be. Just follow the author's four rules (excerpted here): - Use NULLs to indicate unknown/missing information only.
- NULL is not equal to anything, even other NULLs!
- Always use ANSI standard SQL syntax for NULLs.
- ANSI standard COALESCE() and CASE syntaxes are preferred over proprietary syntax.
Actually, I need to clarify, or perhaps disagree with, the first of those rules. The rule itself is correct: Use NULLs to indicate unknown/missing information only. But the author then says "Do not use NULLs in place of zeroes, zero-length strings or other 'known' blank values."The problem I have is the whole concept of a "known blank value." In the article, a distinction is made between NULL, meaning unknown or missing, and an empty string, meaning "known not to have a value." Apparently a NULL middle name is completely different from an empty middle name. An empty middle name is supposed to indicate that the person doesn't actually have a middle name. A NULL middle name would indicate that we don't know whether the person has a middle name or not. I'm sorry, but how's the actual end user of the data supposed to discern the blank spaces on her printed report and know whether they were produced by a NULL or an empty string? I've seen too many real world databases with a mix of NULLS, empty strings and even actual values such as "Unknown" or "N/A" thrown into the mix. Zeroes, zero-length strings or other "known blank values" can be worse than NULLs, because of the additional complexity they introduce. It's almost like the "known blank values" introduce a fourth type of value in three-valued logic! For example, let's say you have a customer table, into which you place information like name, address, postal code and so on. Now consider the lowly apartment or suite number. In Canada, our postal regulations require that the apartment or suite number should preface the street number on the first address line. This means that if you live at 123 Sesame Street, Apt. 6, then your address would be 6-123 Sesame Street. Despite this obviously intelligent method of dealing with apartment or suite numbers, many computerized customer applications, whether they are purchased software products or applications developed in-house, carry a separate apartment or suite column. This being the case, what should you place into the apartment or suite column if the customer doesn't have one? And what should you place into the column if you don't know whether the customer has an apartment or suite number or not? I would like to suggest that it doesn't make a difference. So use NULL. Many database people I've encountered over the years will routinely declare all database columns like this: aptno varchar(12) not null default '' This may seem innocuous at first, but there's a subtle problem. Consider the query you have to write in order to count how many customers live in an apartment or suite:select count(aptno) from customers where aptno > '' However, change the column definition to this:aptno varchar(12) null Then your query becomes this:select count(aptno) from customers Doesn't that make a lot more sense, as well as being simpler?In yesterday's query, we saw a FULL OUTER JOIN with a liberal number of COALESCE functions. COALESCE is used to test for NULLs. COALESCE returns the first non-null value in its list of arguments. In particular, consider this: coalesce(L.Totals,0) as Totals1 Here, the column value from the left table is tested, and if it's NULL, zero is returned instead. Why would it be NULL? Because it's an outer join. Outer joins return NULLs. Even if you declare NOT NULL for every single column in your entire database, you will still encounter NULLs. So it only makes sense to learn to deal with them.What about this: coalesce(L.KeyCol1,R.KeyCol1) What if both L.KeyCol1 and R.KeyCol1 are NULL? In that case, COALESCE would return NULL. However, they won't both be NULL at the same time, because they are both keys, which, as you know, are always declared NOT NULL. Whoa, wait a second. But if keys are always NOT NULL, why do we have to use COALESCE?Because it's an outer join, and that's how they work. In an outer join, some of the information is missing. In its place, you find NULLs. The simplest way to deal with them is to use the COALESCE function (or the equally simple IS [NOT] NULL syntax). And once you can handle NULLs in outer joins, declaring data columns to be NULL instead of NOT NULL won't be quite as scary.
Posted by Rudy Limeback
Full outer join
21 JUL 2005 18:08 EDT (22:08, GMT)
Yesterday's query featured a GROUP BY in an outer query, with a UNION as a nested query or derived table. Today we are going to produce the same results using a full outer join. Perhaps it would be a good idea to summarize quickly the main features of an outer join. If you already know what an outer join is, skip ahead a few paragraphs. The easiest way to understand outer joins is by example. Here are two tables, fruits and colors: | fruits |
|---|
| apple | | banana | | cherry | | orange |
| | | colors |
|---|
| red | | yellow | | orange | | blue | | green |
|
Okay, let's first do an inner join: select F.fruit , C.color from fruits as F inner join colors as C on F.fruit = C.color
Simple, eh? No, you do not need to have a numerical key to join two tables. No, you do not need to have primary or foreign keys to join two tables. You can join two tables on whatever columns you want -- you might not get any results, or you might get too many, but you are allowed to join on whatever columns make sense. Here we are joining fruits to colors by matching on name. Next, let's do a left outer join: select F.fruit , C.color from fruits as F left outer join colors as C on F.fruit = C.color | fruit | color |
|---|
| apple | NULL | | banana | NULL | | cherry | NULL | | orange | orange |
A left outer join means that you will get all rows of the left table, with or without any matching rows from the right table. In this query, fruits is the left table, because it's on the left side of fruits left outer join colors. If there is no matching row from the right table, any columns from the right table in the SELECT are set to NULL. A right outer join is the same as a left, except it works the other way around: select F.fruit , C.color from fruits as F right outer join colors as C on F.fruit = C.color | fruit | color |
|---|
| NULL | red | | NULL | yellow | | orange | orange | | NULL | blue | | NULL | green |
A right outer join means that you will get all rows of the right table, with or without any matching rows from the left table. In this query, color is the right table, because it's on the right side of fruits right outer join colors. If there is no matching row from the left table, any columns from the left table in the SELECT are set to NULL. Ready for a full outer join? select F.fruit , C.color from fruits as F full outer join colors as C on F.fruit = C.color | fruit | color |
|---|
| apple | NULL | | banana | NULL | | cherry | NULL | | orange | orange | | NULL | red | | NULL | yellow | | NULL | blue | | NULL | green |
As you can see, a full outer join is sort of like a left outer join and a right outer join combined. Rows that match are returned, and so are rows that don't match, in either direction. Now, let's take this idea and rewrite yesterday's query: select coalesce(L.KeyCol1,R.KeyCol1) as KeyCol1 , coalesce(L.KeyCol2,R.KeyCol2) as KeyCol2 , coalesce(L.KeyCol3,R.KeyCol3) as KeyCol3 , coalesce(L.Totals,0) as Totals1 , coalesce(R.Totals,0) as Totals2 from Table1 as L full outer join Table2 as R on L.KeyCol1 = R.KeyCol1 and L.KeyCol2 = R.KeyCol2 and L.KeyCol3 = R.KeyCol3
Ah, 'tis a thing of beauty, eh? But what, you may ask, is this COALESCE function all about? Why, it takes care of the NULLS!
Posted by Rudy Limeback
Data from both tables
20 JUL 2005 13:10 EDT (17:10, GMT)
This morning a co-worker asked me for help with an interesting problem. She had developed a query to combine results from two separate tables. These two tables had identical structures as far as the key columns were concerned.Table1 KeyCol1 varchar(12) KeyCol2 integer KeyCol3 varchar(20) Totals integerTable2 KeyCol1 varchar(12) KeyCol2 integer KeyCol3 varchar(20) Totals integer The key columns are the same in both tables. The columns named Totals contain data for each combination of key column values. Many sets of key column values match across both tables, but there are also some keys in Table1 that aren't in Table2, and some keys in Table2 aren't in Table1. I'll come back to this later.Here's the query to combine the data in the two tables: select KeyCol1 , KeyCol2 , KeyCol3 , Totals as Totals1 , 0 as Totals2 from Table1 union select KeyCol1 , KeyCol2 , KeyCol3 , 0 as Totals1 , Totals as Totals2 from Table2 It was at this point that my co-worker was stuck. This UNION query was producing the right data, but it looked like this:KeyCol1 KeyCol2 KeyCol3 Totals1 Totals2 curly 23 nyuk 9 0 curly 23 nyuk 0 37 larry 15 oowwwww 2 0 moe 21 why you 4 0 moe 21 why you 0 4 shemp 8 eebeebeeb 0 11 What she wanted was to make it look like this:KeyCol1 KeyCol2 KeyCol3 Totals1 Totals2 curly 23 nyuk 9 37 larry 15 oowwwww 2 0 moe 21 why you 4 4 shemp 8 eebeebeeb 0 11 Here's what I suggested: Take the UNION query, and think of it as returning a table. Now, just wrap an outer query around the UNION query, specifically by replacing the table name that you normally have in the FROM clause with the UNION query in parentheses:select KeyCol1 , KeyCol2 , KeyCol3 , sum(Totals1) as sum1totals , sum(Totals2) as sum2totals from ( select KeyCol1 , KeyCol2 , KeyCol3 , Totals as Totals1 , 0 as Totals2 from Table1 union all select KeyCol1 , KeyCol2 , KeyCol3 , 0 as Totals1 , Totals as Totals2 from Table2 ) as dt group by KeyCol1 , KeyCol2 , KeyCol3 Notice that I gave this "nested query" an alias name of dt. This stands for derived table. You could give it any name you wish, and you might not even need to declare an alias, depending on your database system. I know of at least one database system that always requires it, so I always just code as dt.So this produced the desired result, and she was happy. I'd like to make two comments about this query. The first is that in the nested query I coded UNION ALL instead of UNION. The reason for this is to avoid a totally unnecessary sort. Let me explain. If you say UNION, instead of UNION ALL, you are asking the database to return distinct rows. The way it does this is by combining the rows that are produced by the two SELECTS that make up the UNION query, and then sorting the combined set of rows on all columns. Then it scans the sorted rows and examines them two at a time, going through the entire result set like that, throwing out duplicate rows. This behavior could either be desirable or totally the opposite. Consider the query to find the customers who have an account with us. select customer from savings_accounts union select customer from current_accounts Let's say you are just looking for a list of customers. The fact that it's UNION means you will not see any duplicate customers. Since this is exactly what you want, you say UNION.But what if you wanted a total account balance for each customer? Well, you could do what we did above, use SUM and GROUP BY in an outer query with the UNION query as a nested query. select customer , sum(acct_balance) as total_balance from ( select customer, acct_balance from savings_accounts union select customer, acct_balance from current_accounts ) as dt group by customer This works fine, except for one thing. Suppose there's a customer with exactly $937.00 in both accounts. Your total will be wrong. This "whoopsie" is a direct result of UNION removing duplicate rows. In this case, this is not what you want, so you have to say UNION ALL. The way I remember the difference is by the fact that ALL means all rows (i.e., don't remove any, even if they are duplicates).Wait a second (I can hear some of you thinking). In the original UNION query we looked at, there couldn't possibly be any duplicate rows produced by the two SELECTs. One of them returns rows with data in Totals1 and 0 in Totals2, while the other one returns rows with 0 in Totals1 and data in Totals2. So how could there be any duplicate rows for UNION to try to remove? That is correct, and that is why you say UNION ALL -- to avoid the unnecessary sort and scan for duplicates, of which there aren't any anyway. The second thing I wanted to say about my solution of using a GROUP BY outer query on a derived table produced by a nested UNION query, is that this isn't the only way to solve the problem. We could also do it with a FULL OUTER JOIN. Remember earlier when I said that many sets of key column values match across both Table1 and Table2, but there are also some keys in Table1 that aren't in Table2, and some keys in Table2 that aren't in Table1. Using a FULL OUTER JOIN, you can get not only sets of rows that match on key values, but also those unmatched rows from either table that don't have a matching key value in the other. I'll explain this in more detail tomorrow.
Posted by Rudy Limeback
Good questions to ask me
19 JUL 2005 00:52 EDT (04:52, GMT)
Yesterday's entry was not posted on the site all day (I can't post it myself, unfortunately), so you missed my call for questions. Meanwhile, five more questions came in. So that's really good news. I love questions, so please keep them coming. The bad news is, four of the people who submitted new questions aren't going to get the answers they expect. I'll deal with these questions in today's entry. I won't always do this, but today I hope to influence the types of questions still to be submitted. Questions I can not answer The first question is deceptively simple: How to calculate upper quartiles and lower quartiles using SQL query? I have dates 1, 4, 9, 16, 25, 33, 36, 49, 64, 81. My first reaction is "Why?" There are better tools out there for doing statistical analysis than SQL. Quartiles can be done in SQL, it's just that I personally can't do them off the top of my head. (I'd need to spend a day or so writing and testing queries until I got it right. If I got it right.)Meanwhile, a quick Google search for "quartile SQL" turns up Calculating quartiles, by sheer coincidence in this very same Expert Answer Center. Maybe I cannot answer your question, but maybe someone else already did. Please take a few moments to check at least the first page or two of search results in your favorite search engine. Questions I will not answer The next question is a beauty: I'm trying to find how to write a SQL query without creating a store procedure for the following assignment. Given a table of employees containing the following data for all employees in the company:- Name (Employees' name)
- ID (Employee's ID)
- Salary (Employee's salary)
- Manager_ID (The ID of the employee's manager)
- Department (Employee's department)
How can I write a SQL query to return the number of employees in each department who make more money than their immediate manager? My answer to this question is: "Sorry, but I don't do homework assignments." At least, not in public. Contact me personally and you might be able to hire me. My rates for homework assignments are very, very high.Questions I can answer but so can you The next one is really too small, and I didn't want to post it as a separate answer in the Expert Answer Center: Does mySQLl 4.0.23 support triggers? Please tell me which version, if any. The answer is "No, you need 5.0.2." I didn't know the exact version number, but I found it within about 12 seconds on the MySQL site. By the way, the current production release is 4.1. Be careful developing an application using a pre-production release like 5.Questions I can maybe answer, but don't want to Finally, this one: I have the following SQL in my program. When I do the FETCH, the index is not being used (indexes are SAP_ORDER_NBR and TIMESTAMP).EXEC SQL DECLARE ORMSGSV1_CURSOR CURSOR FOR SELECT SAP_ORDER_NBR ,TIMESTAMP ,EVENT_ID ,TRANS_TYPE ,REF_NBR ,MESSAGE_AREA ,STATUS_CD ,PGM_NAME ,SAP_ITEM_NBR FROM RFUMNUD0.ORMSGSV1 WHERE SAP_ORDER_NBR = :DCLOHSHMTV1.SAP-ORDER-NBR AND SAP_ORDER_NBR > ' ' AND SAP_ORDER_NBR BETWEEN '0000000000' AND '9999999999' AND TIMESTAMP > '0001-01-01-00.00.00.000001' OPTIMIZE FOR 10 ROWS FOR READ ONLY END-EXEC. Can you give me any help on why the FETCH is doing a tablespace scan and not using the indexes? It appears that you are feeding the query a specific parameter value in :DCLOHSHMTV1.SAP-ORDER-NBR. So why do you also have to check to make sure that this value is greater than a string of blanks and, in addition, between "0000000000" and "9999999999"? Are some of your order numbers actually blank or not numeric? Similarly with the timestamp, are there really rows in your table with a timestamp that is less than or equal to "0001-01-01-00.00.00.000001"? These types of checks should be done in the application, if they are needed at all (they should not be, since you should feed in only valid parameter values).I'm guessing that if you took these conditions out of the SQL, maybe the index might be used. But I'm not sure, and I really don't want to give you false hope. I would have to come over to your office to see your actual data to understand why the query got written that way in the first place. Even then, I couldn't promise that removing these conditions would solve the performance problem. Performance usually depends more on non-SQL factors than the actual SQL, although this question does make a strong case for being careful that the SQL doesn't actually hinder performance.
Posted by Rudy Limeback
Hello and welcome!
18 JUL 2005 17:37 EDT (21:37, GMT)
Hello, and welcome to my SQL blog. First off, thanks to the folks here at the Expert Answer Center for giving me the opportunity to write about SQL on a daily basis. I hope you will enjoy the next two weeks. What topics will this SQL blog cover? SQL. That's it. More specifically: how to approach problems that you need to solve with SQL and how to "think in SQL." My preferred method of explaining SQL is through examples. It's a lot easier for me when I can refer to such-and-such table in the JOIN or this column in the GROUP BY clause and that other column in the SELECT list. Raw syntax is fine, but hey, that's what the manual is for. So I would like, as much as possible, to focus my writing efforts over the next two weeks on your questions. Three of you have already kindly submitted a question. I have answered two of them straight off (All about data purging and How can I remove $ signs in a temporary table?), and I'll do a more detailed analysis of the third in today's entry. Selecting a range of dates Here is one of the initial SQL questions submitted: I wrote a query that executes sales orders from the first of the current month to the current date and this query only runs on Sundays of every week.table_name ( trx_date date ) Assume the month is AUG-05 and SYSDATE is 28-AUG-05Select trx_date From table_name Where trx_date between TO_CHAR (TRUNC (SYSDATE,'MM')) AND TO_CHAR(SYSDATE) Output 01-AUG-05 02-AUG-05 05-AUG-05 11-AUG-05 12-AUG-05 13-AUG-05 28-AUG-05 Now if trx_date is on 29,30 and 31 of AUG-05, the sales orders wont appear from the query. How can I include these dates (29, 30 and 31) and display them on the 28-AUG-05?Something like: 01-AUG-05 02-AUG-05 05-AUG-05 11-AUG-05 12-AUG-05 13-AUG-05 28-AUG-05 29-AUG-05 30-AUG-05 31-AUG-05 Okay, let's take it from the top. First of all, what does "execute sales orders" mean? A query just extracts data; it does not change or update anything. So, if something is done with the sales order information that's extracted by this query every Sunday, it isn't done by this query. I'm going to assume that this query is merely some type of sales report. Which is fine, since SQL is excellent for this task. If you want to select, filter, extract, reformat, summarize and/or report data, it is almost always faster and will almost certainly be more efficient to do so using SQL than if you wrote application code to do the same task.Okay, let's move on to the actual query. All the action is in the WHERE clause. Those of you who know Oracle will immediately recognize the TO_CHAR and TRUNC and SYSDATE functions. Those of you who don't know Oracle might at this point be wondering "What the ...?" I do know Oracle, but not terribly well, and I'm still trying to figure out whether that BETWEEN condition would actually work or not. Apparently what's intended is that the WHERE clause is supposed to restrict the rows returned by the query to only those where the trx_date is within the same month as the current date. Now, in other database systems, you have functions like YEAR and MONTH, and you would use them like this: where year(trx_date) = year(current date) and month(trx_date) = month(current date) Note that it's important to be in the same year as well as the same month. The submitted query tests only the character month. So that would need to be fixed.Furthermore, off the top of my head I can't tell you without looking it up in the Oracle manual (and I don't really want to) whether TRUNC(..,'MM') generates 1 or 01 for January. Because if it generates 1 instead of 01, you are going to get wrong results for October, November and December, which all have a 1 as the first character, assuming that TO_CHAR(SYSDATE) generates a string that always has the month in the leftmost position, which is what the submitted query assumes. But is that safe? Isn't there an Oracle system setting where you can change the default format of date strings? But the main point is yet to come. The "best practices" approach to this sort of query is never to apply a function to a date column in a WHERE clause condition, but rather, compare the column value as a date value to constants. The reason has to do with performance. In this example, what you're looking for is something like this: where trx_date between '2005-08-01' and '2005-08-31' With this SQL construction, the optimizer can utilize an index on trx_date, if one exists. If you perform a function on the column value, the optimizer might do a table scan instead.But does this mean that you have to hard code the dates into the query every month? Most definitely not. What you do is write expressions involving the current date. In Oracle it's SYSDATE, and other databases have similar functions. You'll need an expression that results in the date of the first day of the current month and another for the date of the last day of the current month. So let's make that today's assignment. In addition to submitting your questions about SQL, feel free to submit your solution to today's WHERE clause. As for the original question we've discussed today, how to include on Sunday the 28th the orders for the 29th through 31st, I honestly don't know what this means. How can you have future orders? I mean, it isn't a ship date we've been talking about, it's a transaction date! And assuming it's okay to include these future orders on the 28th, why couldn't you also do it on the 21st if there is an order sitting there ready to go for the 22nd? If the person who originally posted this question would like to offer additional clarification, I'd be happy to take my analysis of the query further later this week. And if any of you have any comments about today's entry, please send those in as well.
Posted by Rudy Limeback
|