SQL – Recursive Summing of Related Entities

5 12 2009

December 5, 2009

Table dir_size stores the mbytes of storage used in a given directory. Table directories stores various directory names which may or may not exist in table dir_size.
For every directory in table directories, report the cumulative storage in that directory and all its subdirectories. This solution uses a cartesian join.  I imagine it will not scale well.

create table dir_size (
dir_name     varchar2(40),
mbytes       number
);

create table directories (
dir_name    varchar2(40)
);

insert into dir_size values ('c:\aaa\bbb\ccc\ddd', 100);
insert into dir_size values ('c:\aaa\bbb\ccc', 100);
insert into dir_size values ('c:\aaa\bbb', 100);
insert into dir_size values ('c:\aaa', 100);
insert into dir_size values ('c:\', 100);
insert into directories values ('c:\aaa\bbb\ccc\ddd');
insert into directories values ('c:\aaa\bbb\ccc');
insert into directories values ('c:\aaa\bbb');
insert into directories values ('c:\aaa');
insert into directories values ('c:\');
insert into directories values ('c:\xxx\yyy\zzz');
commit;

select dir_name, sum(mbytes) from (
select directories.dir_name,
instr(dir_size.dir_name, directories.dir_name) INSTR,
mbytes
from directories, dir_size
)
where INSTR = 1
group by dir_name
order by 1;

DIR_NAME                                 SUM(MBYTES)
---------------------------------------- -----------
c:\                                              500
c:\aaa                                           400
c:\aaa\bbb                                       300
c:\aaa\bbb\ccc                                   200
c:\aaa\bbb\ccc\ddd                               100

This appears to be a hard problem.  To avoid headaches, make certain that each of the DIR_NAMES ends with “\”

Let’s start here:

SELECT
'c:\aaa\bbb\ccc\ddd\' DIR_NAME,
100 MBYTES
FROM
DUAL;

DIR_NAME                 MBYTES
-------------------- ----------
c:\aaa\bbb\ccc\ddd\         100

In your example, you would like to put 100MB into the following directories based on the above:

c:\
c:\aaa\
c:\aaa\bbb\
c:\aaa\bbb\ccc\
c:\aaa\bbb\ccc\ddd\

You somehow need to be able to break that one row into 5 rows.  The following might help

SELECT
LEVEL L
FROM
DUAL
CONNECT BY
LEVEL<=20;

L
---
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

If we join those two row sources together we might be able to create 5 rows from the one row:

SELECT
SUBSTR(DIR_NAME,1,INSTR(DIR_NAME,'\',1,L)) DIR_NAME2,
MBYTES
FROM
(SELECT
'c:\aaa\bbb\ccc\ddd\' DIR_NAME,
100 MBYTES
FROM
DUAL) DIR_SIZE,
(SELECT
LEVEL L
FROM
DUAL
CONNECT BY
LEVEL<=20) C
WHERE
SUBSTR(DIR_NAME,1,INSTR(DIR_NAME,'\',1,L)) IS NOT NULL;

DIR_NAME2                MBYTES
-------------------- ----------
c:\                         100
c:\aaa\                     100
c:\aaa\bbb\                 100
c:\aaa\bbb\ccc\             100
c:\aaa\bbb\ccc\ddd\         100

Now, if we performed the same process for all of the rows in the DIR_SIZE table, grouping on DIR_NAME2, we might be able to find the SUM of the MBYTES column.

(Note that I did not provide an exact/final answer to the original poster – my post was intended to push the OP in the right direction of a solution.)

The OP followed up with this comment:

Thanks for the suggestion.  I suspect the best way will involve some kind of recursive processing.  The tricky bit is the matching of the rows in the directories table to the rows in the dir_size table.  We need to do a “like” (which we can’t, of course) which is why I thought of the instr.

The LIKE keyword is not necessary.

Notice how closely the output of the following SQL statement:

SELECT
'c:\aaa\bbb\ccc\ddd\' DIR_NAME,
100 MBYTES
FROM
DUAL;

Matches the row created by one of your insert statements:

insert into dir_size values ('c:\aaa\bbb\ccc\ddd', 100);

You might try replacing in the above examples:

SELECT
'c:\aaa\bbb\ccc\ddd\' DIR_NAME,
100 MBYTES
FROM
DUAL;

With a SQL statement that selects all of the rows from your DIR_SIZE table – the results might surprise you IF each of the DIR_NAME values end with a “\”.
You really need more variety in the insert statements to see what is happening, for example:

insert into dir_size values ('c:\ddd\', 800);
insert into dir_size values ('c:\ddd\kkk\', 300);

The first of the above SQL statements will increase the calculated SUM in the c:\ directory by 800, and the second insert statement will increase the SUM in both of the c:\ and c:\ddd\ directories by 300 if you modify my original example to use the DIR_SIZE table rather than the DUAL table.
The final part that I did not provide to the OP is below:

TRUNCATE TABLE DIR_SIZE;

insert into dir_size values ('c:\aaa\bbb\ccc\ddd\', 100);
insert into dir_size values ('c:\aaa\bbb\ccc\', 100);
insert into dir_size values ('c:\aaa\bbb\', 100);
insert into dir_size values ('c:\aaa\', 100);
insert into dir_size values ('c:\', 100);
insert into dir_size values ('c:\ddd\', 800);
insert into dir_size values ('c:\ddd\kkk\', 300);

Working with the hints provided and the final SQL statement in my post, we start with the following:

SELECT
SUBSTR(DIR_NAME,1,INSTR(DIR_NAME,'\',1,L)) DIR_NAME2,
MBYTES
FROM
(SELECT
*
FROM
DIR_SIZE) DIR_SIZE,
(SELECT
LEVEL L
FROM
DUAL
CONNECT BY
LEVEL<=20) C
WHERE
SUBSTR(DIR_NAME,1,INSTR(DIR_NAME,'\',1,L)) IS NOT NULL;

DIR_NAME2                MBYTES
-------------------- ----------
c:\                         100
c:\                         100
c:\                         100
c:\                         100
c:\                         100
c:\                         800
c:\                         300
c:\aaa\                     100
c:\aaa\                     100
c:\aaa\                     100
c:\aaa\                     100
c:\ddd\                     800
c:\ddd\                     300
c:\aaa\bbb\                 100
c:\aaa\bbb\                 100
c:\aaa\bbb\                 100
c:\ddd\kkk\                 300
c:\aaa\bbb\ccc\             100
c:\aaa\bbb\ccc\             100
c:\aaa\bbb\ccc\ddd\         100

SELECT
SUBSTR(DIR_NAME,1,INSTR(DIR_NAME,'\',1,L)) DIR_NAME2,
SUM(MBYTES) MBYTES
FROM
DIR_SIZE,
(SELECT
LEVEL L
FROM
DUAL
CONNECT BY
LEVEL<=20) C
WHERE
SUBSTR(DIR_NAME,1,INSTR(DIR_NAME,'\',1,L)) IS NOT NULL
GROUP BY
SUBSTR(DIR_NAME,1,INSTR(DIR_NAME,'\',1,L))
ORDER BY
SUBSTR(DIR_NAME,1,INSTR(DIR_NAME,'\',1,L));

DIR_NAME2                MBYTES
-------------------- ----------
c:\                        1600
c:\aaa\                     400
c:\aaa\bbb\                 300
c:\aaa\bbb\ccc\             200
c:\aaa\bbb\ccc\ddd\         100
c:\ddd\                    1100
c:\ddd\kkk\                 300