|
Why Generate Random Values
Databases are supposed to be for maintaining factual data — with
purity, integrity and reliability. So why are we interested in
generating random data? The answer to this is that data is a
representation of real-world facts and figures. While building a
system, we try to mimic the real-life data when testing proper
indexing, materialized view creation, partitioning, optimization
paths, access methodologies, and so on. The trick is to generate
such a data set that is random enough to represent the real
world, yet follows the same statistical pattern as the target
database. Note the key words here — it must be random, but needs
to follow the same statistical pattern as real life.
So how about generating some values such as the following:
-32nr
-32nr3121ne –e21e
323-=11r- r
0-vmdw-dwv0-[o- rr0-32r2 0
r4i32r –rm32r3p=x ewifef-432fr32o3-==
The values in this previous code example are all generated by
randomly pecking the keys of the keyboard. Random enough, right?
Not quite. Consider the following: What if you are generating a
data set to test out a new application by populating bank
accounts. In the U.S., account last names vary widely, but first
names are usually from a limited set — John, Jane, Michael, and
so on. While generating a sample, you may want to make sure that
the first names are populated from actual names such as these
only, not from some random, arbitrary characters as shown in the
previous example. They must be randomly distributed throughout
the table in such a way that the optimizer will follow the same
path as in a real-life system. Further, you may want it to
follow some more statistical patterns (e.g., 25 percent of the
first names should be John, eight percent should be Michael, and
so on). Creating a data set that closely mimics real life is
important in building a test bench to test application and
database development.
On the surface this seems like a trivial and/or fruitless
exercise; but in reality, it is one of the most important
activities of system design. A few examples will perhaps explain
why.
Building a Perfect System
Say an application is going to be developed at Acme Bank. The
team is discussing the nuts and bolts of the project —
everything from tablespace and file-layout to indexing and
partitioning schemes. Their biggest problem is that the database
they are designing will contain projected data only, and no
meaningful data exists with which to test their plan.
Jill, the lead developer, is mulling over the type of indexing —
should she use regular tables with b-tree indexes or Index
Organized Tables? How would she know without generating some
type of data?
Next, Deborah is wondering whether partitioning will help or
hurt the performance. It would make management easier, sure; but
query performance? To test her ideas, she needs a table
prepopulated with sample data.
Cathy, the system architect, is proposing a few materialized
views. But will this performance enhancement justify the stale
data?
At the same time, John, the DBA, is thinking about the best
partitioning strategy and is conflicted about whether to use
range or to use hash.
Finally, Jack, another DBA, is thinking about placing the
datafiles for certain tablespaces on fast disks; but which ones?
These are not hypothetical problems, but real-life challenges
faced by a team designing a system for optimal performance and
value. There has never been a lack of ideas, but a common
problem all these people face is the lack of a dataset that
represents the database they are going to build. They could
build such a dataset from scratch, but they don’t have the means
to enter each and every piece of data by hand.
Solution?
The answer to this problem is to generate random values, but not
arbitrary ones. But the values must be distributed following
some statistical pattern to reflect real life; a mere update
will not help in this case. For instance, suppose your
requirement calls for distributing first names as follows:
-
John — 30 percent
-
Abby — 10 percent
-
Scott — 5 percent
-
Jill — 5 percent
-
Mike — 50 percent
There are 1,000,000 rows in the test table, so you could specify
the following to achieve this pattern:
To update 350,000 or 35 percent of 1 million rows:
update
accounts set first_name = ‘John’
where rownum < 350001;
To update the next 5 percent of 1 million rows, or 50,000
records, to Abby:
update
accounts set first_name = ‘Abby’
where first_name != ‘John’
and rownum < 50000
... and so on. This procedure will satisfy the overall
distribution of the values, but the values will be distributed
as follows: The first 35 percent of the records will be for the
first-name John, which just won’t happen in real life. In a real
table, the first name John will be distributed throughout the
table, not in the first 35 percent of the rows. This imperfect
distribution would impact heavily on operations such as index
scans. Since the optimizer decides between full-table scan and
index scan based on on how the data is distributed, its choice
would be greatly affected by the unusual, lopsided availability
of values at the head end of the table.
Similarly, these values would probably be concentrated in a
small portion of the disk or disk groups, and this would skew
the results of the I/O test. So, such an unrealistic
distribution of values of the first name John is neither helpful
nor desirable.
Other Uses
Values must also be randomly generated when you are creating Web
site user IDs for registered users, or when you are generating
temporary passwords. Although these are jumbled characters and
numbers, the data must fit a pattern (e.g., passwords must be
between eight and 12 characters and should have at least four
numbers and four characters, and so on).
Although it’s not important anymore in Oracle 10g, the
encryption feature requires random keys to be generated. And the
keys must be random enough to be difficult to guess.
Building a Random Character Value Generator
Oracle Database 10g provides a random value generator
via a package named
DBMS_RANDOM, which can be used to
generate random values. However, those values are arbitrary, not
representative of the real-life system. For instance, the
following an example of how you can generate a string 30
characters long:
SQL> select
dbms_random.string('P',30) from dual;
DBMS_RANDOM.STRING('P',30)
-----------------------------------------------------
O=*KXh}#O`D~1|'^VheWr:jZ>!:P}3
The package DBMS_RANDOM
has several functions and procedures, one of which is
STRING(), which
accepts two parameters and returns a character value. The types
of characters generated are determined by the first parameter —
OPTION. Here are
the options and the types of string generated:
|
Option |
Type of String Produced |
|
U |
Any uppercase alphabetic characters |
|
L |
Any lowercase alphabetic characters |
|
A |
Any mixed case alphabetic characters |
|
X |
Any alphanumeric character in upper case |
|
P |
Any printable character |
You can also use lowercase characters for the options; e.g., x
instead of X. In this example, we used X, which produced a
string of printable characters, not just alphabetic ones.
Clearly, this kind of character string will not represent a
customer name. Using an option such as “A” would generate a more
appropriate string.
Populating a Table
Let’s start with a real-life example. In the case of Acme Bank’s
application development, the team is mulling over the proper
indexing, partitioning, physical layout, materialized view
planning, and so on, for the ACCOUNTS table, which holds the
customers’ account data. Here is how the table looks :
SQL> desc accounts
Name Null? Type
----------------- -------- ------------
ACC_NO NOT NULL NUMBER
FIRST_NAME NOT NULL VARCHAR2(30)
LAST_NAME NOT NULL VARCHAR2(30)
ACC_TYPE NOT NULL VARCHAR2(1)
FOLIO_ID NUMBER
SUB_ACC_TYPE VARCHAR2(30)
ACC_OPEN_DT NOT NULL DATE
ACC_MOD_DT DATE
ACC_MGR_ID NUMBER
The columns are fairly self-explanatory. To represent customers
in the real world, the following requirements have been
specified for the columns:
|
Column Name |
Purpose |
Data Pattern |
|
ACC_NO |
Account Number |
Any number less than 10 digits |
|
FIRST_NAME |
The first name |
-
10% Alan
-
10% Barbara
-
5% Charles
-
5% David
-
15% Ellen
-
20% Frank
-
10% George
-
5% Hillary
-
10% Iris
-
10% Josh
|
|
LAST_NAME |
The last name |
Any alphabetic character between four and 30 but 25
percent should be “Smith” |
|
ACC_TYPE |
The type of account — Savings, Checking, and so on |
20 percent each of S, C, M, D, and X |
|
FOLIO_ID |
The folio ID from the other systems |
Half NULL and the rest half a number related to the
account number |
|
SUB_ACC_TYPE |
If the customer is incorporated, then sub-account types,
if any 75 percent null |
From the values populated:
|
|
ACC_OPEN_DT |
Date account was opened |
A date between now and 500 days ago |
|
ACC_MGR_ID |
The ID of the account manager servicing the account
|
There are five account managers, with account
percentages distributed as follows:
-
1 — 40 percent
-
2 — 10 percent
-
3 — 10 percent
-
4 —10 percent
-
5 — 30 percent
|
As you can see, some fairly complex requirements were specified,
but for good reasons. These accurately reflect how the data will
be distributed in real life. In real life, there will be
customers with first names like “Josh” and “Ellen,” not “XepqjEuF”;
so, the names must be chosen from the set of possible names.
And, in the U.S., people are called by a variety of last names.
Thus, we want a semi-random distribution with 25 percent of a
very popular last name, “Smith.”
Generating Random Numbers
Before we go any further, we have to discuss how to generate
random numbers. The package
dbms_random
contains a function to return random numbers. If you want to
return any random number between -2^31 and +2^31, use the
function RANDOM. The following shows how we have generated
numbers 10 times:
SQL> begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random Number='||dbms_random.random);
4 end loop;
5 end;
6 /
Random Number=-81420432
Random Number=-1024262734
Random Number=-1965250926
Random Number=1439118604
Random Number=675429938
Random Number=-466227661
Random Number=613708106
Random Number=-1154140330
Random Number=-643127572
Random Number=-133140229
This function is available but deprecated. Oracle recommends
using a new function called VALUE().
SQL> l
1 begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random Number='||dbms_random.value);
4 end loop;
5* end;
SQL> /
Random Number=.547109841457281046373693994862305187
Random Number=.00500720861223234783817674992068380455
Random Number=.86974675431616311272549579082479240362
Random Number=.16306119784529083761710557238498944243
Random Number=.39327329046753189206427695323437381763
Random Number=.68307381430584611139249432690613072007
Random Number=.24640768414299435941101562183729221882
Random Number=.50568180275705934132098716939122047439
Random Number=.73345513921455391594299189253661168712
Random Number=.71764049051903979360796313613342367114
Note the difference; VALUE returns a positive random decimal
number less than one, with 38 digits after the decimal point. In
most cases, this is probably enough, with some modification (if
you are looking for a 10-digit whole number, for instance,
simply multiply it by 10,000,000,000 and use FLOOR() to discard
the decimals).
1 begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random Number='||
4 floor (10000000000 * dbms_random.value)
5 );
6 end loop;
7* end;
SQL> /
Random Number=939084911
Random Number=690705371
Random Number=4696700513
Random Number=7978266084
Random Number=5157885833
Random Number=8902042948
Random Number=5839885968
Random Number=6207324613
Random Number=5633096626
Random Number=1891871746
The function VALUE() is overloaded; another variation of the
functions accepts a set of low and high values and returns a
number between them. To generate a number between 10 and 20, for
instance, you will use
dbms_random.value (10, 20)
Note: The numbers generated will be more than
or equal to 10, but less than (never equal to) 20.
Using it to generate multiple values, we get
1 begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random Number='||
4 dbms_random.value (10,20)
5 );
6 end loop;
7* end;
SQL> /
Random Number=11.0908840421899990054735051046498510644
Random Number=14.5391525077977177619575126573478641174
Random Number=11.1343623028437386191996545034616142284
Random Number=14.8831724931888089801812805693409810454
Random Number=11.7294096895635532492353976105129628219
Random Number=15.0589018451126293465505440533299716691
Random Number=15.6152418028444556585301007183814067306
Random Number=18.2903839241741813303152060555229669855
Random Number=13.0914352695913402037626709927051618367
Random Number=12.6027933168087755621076348461280998683
As you can see, the numbers are generated with 37 digits after
the decimal point. This is a very valuable function and will
suffice for lmost types of random number generation. If you want
to generate only whole numbers, use the same approach used
earlier, using the FLOOR() function.
floor(dbms_random.value (10,20))
Generating Specific Characters and Strings
So, how can we generate specific strings that follow a
predetermined statistical pattern?
We can borrow a page from the probability theory text to
accomplish this. The trick is to use a method similar to
Monte-Carlo simulation used by statisticians worldwide. In this
approach, we generate a random number, between one and 100 (both
inclusive). Over a period of time, the probability that a
specific number, say six, will turn up is exactly one time out
of 100, or 1 percent of the time. In fact, all the numbers have
1/100 probability. Going by the same approach, the probability
that either of two numbers — say, 1 and 2 — will be 2 percent.
And, of course, the probability that any one of numbers between
one and 10 will turn up is 10 percent. We will use this to
configure the probability of the random value.
Take, for instance, the value of the column
ACC_TYPE, which
calls for equal probability of S, C, M, D and X; or 20 percent
probability each. If we generate a whole number between one and
five (both inclusive), the probability of each number will be 20
percent. Then we can use a
DECODE() function
to get the ACC_TYPE
value.
SQL> select
2 decode (
3 floor (
4 dbms_random.value (1,6)
5 ),
6 1,'S',
7 2,'C',
8 3,'M',
9 4,'D',
10 'X'
11 )
12 from dual;
First, we are generating a number between one and five (line 4).
Since the number is generated is less than the highest value
passed as a parameter, we have specified six. And since we want
a whole number, we have used the
FLOOR() function
in line 3. It truncates all decimal values from the generated
number. Depending on the number obtained, we used
DECODE() to get
one of the values — S, C, M, D, or X. Since the numbers 1, 2, 3,
4, and 5 will have equal probability of being generated, so will
be the letters — at 20 percent each.
This technique is very valuable for generating random, but
useful values, as shown previously. The same approach can be
used to generate almost all types of pre-determined random
values.
Random Values with NULLs
Remember, the requirement for
FOLIO_ID is a
little different. It needs only 50 percent of the values
populated; the rest should be
NULL. How can we
achieve this?
Quite simply, we will use the same probability approach with a
twist: we will use a determination of yes or no. Generating a
random number between one and 100 will ensure 1 percent
probability of each number. Hence, a number less than 51 will
have exactly 50 percent probability of occurring. We can use
this in a CASE statement to get the value.
SQL> select
2 case
3 when dbms_random.value (1,100) < 51 then null
4 else
5 floor(dbms_random.value(1,100))
6 end
7 from dual;
On line 3, we should check to see whether the number generated
is less than 51. If so, we return
NULL. Since the
probability of a sub-51 number is 50 percent , we have NULLs
occurring 50 percent of the time as well. In the other 50
percent of the time, we have generated a value to be used as a
FOLIO_ID.
Random Strings of Random Length
In dbms_random.string,
a random string is generated, but of fixed length. That is not
representative of real life; in reality, people have last names
of varying lengths. In this example, the requirement is to have
a length between four and 30 characters. To facilitate this, we
can pass the length as a random number as well to the function
dbms_random.string
in line 6 below.
1 begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random String='||
4 dbms_random.string (
5 'A',
6 dbms_random.value(4,30)
7 )
8 );
9 end loop;
10* end;
SQL> /
Random String=RniQZGquFVJYFpGLOvtNd
Random String=GhcphpcsaCXlhigRQY
Random String=JtakoelUf
Random String=BgCOu
Random String=QFBzQxcHqGlHWkZFmnN
Random String=lSxVjqJvpwBB
Random String=jfhNARzALrLOKZRpOwnhrzz
Random String=KuFtdJcqQpjkrFmzFbzcXnYFGjWo
Random String=BhuZ
Random String=GebcqcgvzBfEpTYnJPmYAQdb
Notice that the strings are of different lengths. Remember, 25
percent of the last names must be “Smith,” and the rest must
have random lengths. We can accomplish this by combining the
random strings and the Monte-Carlo approach:
decode (
floor(dbms_random.value(1,5)),
1,'Smith',
dbms_random.string ('A',dbms_random.value(4,30))
)
The previous expression will return “Smith” 25 percent of the
time and a random alphabetic string between four and 30
characters long the rest of the time.
Putting it All Together
Now that you understood the building blocks of the randomization
approach, you can put them together to build the account record
generation PL/SQL block as shown below. In the following
example, we are loading 100,000 records into the table
ACCOUNTS. Here is
the loading program in full:
begin
for l_acc_no in 1..100000 loop
insert into accounts
values
(
l_acc_no,
-- First Name
decode (
floor(dbms_random.value (1,21)),
1, 'Alan',
2, 'Alan',
3, 'Barbara',
4, 'Barbara',
5, 'Charles',
6, 'David',
7, 'Ellen',
8, 'Ellen',
9, 'Ellen',
10, 'Frank',
11, 'Frank',
12, 'Frank',
13, 'George',
14, 'George',
15, 'George',
16, 'Hillary',
17, 'Iris',
18, 'Iris',
19,'Josh',
20,'Josh',
'XXX'
),
-- Last Name
decode (
floor(dbms_random.value(1,5)),
1,'Smith',
dbms_random.string ('A',dbms_random.value(4,30))
),
-- Account Type
decode (
floor(dbms_random.value (1,5)),
1,'S',2,'C',3,'M',4,'D','X'
),
-- Folio ID
case
when dbms_random.value (1,100) < 51 then null
else
l_acc_no + floor(dbms_random.value(1,100))
end,
-- Sub Acc Type
case
when dbms_random.value (1,100) < 76 then null
else
decode (floor(dbms_random.value (1,6)),
1,'S',2,'C',3,'C',4,'C',5,'C',null)
end,
-- Acc Opening Date
sysdate - dbms_random.value(1,500),
-- Account Manager ID
decode (
floor(dbms_random.value (1,11)),
1,1,2,1,3,1,4,1,5,2,6,3,7,4,8,5,9,5,10,5,0
)
);
end loop;
commit;
end;
Now, how do we know that all these exercise yielded fruit? After
this table is loaded, let’s see the actual distribution:
SQL> select
first_name, count(*) from accounts group by first_name;
FIRST_NAME COUNT(*)
------------------------------ ----------
Alan 9834
Barbara 10224
Charles 5046
David 4980
Ellen 15094
Frank 14960
George 14890
Hillary 4898
Iris 10009
Josh 10065
Great! The distribution for each first name is exactly we
wanted. For instance, we wanted to have 10 percent rows with
first name “Alan,” and we got 9,834 out or 100,000; this equates
to approximately 10 percent . We wanted 15 percent with the name
“Ellen” and we got 15.094% — pretty close to that number and
statistically significant. Similarly, you can go through all the
other columns and see how they are actually distributed.
Conclusion
Generating random values in PL/SQL is a much-sought-after, but
often less-understood aspect of the PL/SQL language. As we saw
in the previously shown case, generating random numbers or
strings is not what is most often needed. The real need is to
generate data representing the real world, which requires a
different approach, using the same built-in functions. In this
article, you learned how to generate random values following a
predetermined statistical approach, and then how to use it to
populate a system mimicking real life.
Further Reading
An updated copy of this article may be found on
www.proligence.com/downloads.html.
--
Arup Nanda
is the Lead DBA at Starwood Hotels & Resorts. He has been an
Oracle DBA for more than 11 years, touching all aspects of
database management — from modeling to performance tuning to
disaster recovery. He is the co-author of the book
Oracle Privacy Security Auditing (2003, Rampant Tech
Press). |