Managing XML Data in Oracle XML DB 11g
by V.J. Jain -
From Oracle Technical Network
Over the past few
years, XML has emerged as the new standard for data transmission,
and its use is becoming more prevalent as companies continue to
adopt XML-based solutions. As more organizations begin to enforce
XML standards for all data transmission, increasingly complex XML
formats are emerging. These complex formats can include multiple
namespaces, thousands of elements, and recursive definitions. As the
XML documents produced from these formats grow in size and
complexity, managing this content has become increasingly
challenging, with limited information available on how to address
this challenge.
In this article, you
will learn how to use the XML DB feature in Oracle Database 11g
to manage complex XML content as well as its advantages over
commercial ETL products. You will see an example of a complex XML
schema that demonstrates the following:
-
Registration of a
complex XML schema
-
Insertion of XML
files into the database
-
Retrieval of XML
data via relational query
-
In-Place Evolution
for XML schema modifications
Furthermore, you will
get an overview of strategies for maximizing the performance and
throughput of Oracle XML DB solutions and practical applications of
complex XML formats.
Oracle XML DB Background
Oracle XML DB is a
feature of Oracle Database that provides a powerful tool for
managing XML content, including storage, manipulation, and
retrieval. It offers different storage options to meet the unique
requirements of different XML formats. These options include
unstructured, binary, and structured storage:
-
Unstructured (character large object, or CLOB).
By treating the document as one large object and storing it in
the database, this method allows for the best insertion times.
However, this storage method also consumes the most space and
has the worst performance for relational access to the data.
This is an impractical solution for managing large, complex XML
documents if relational access is required. Unstructured storage
can be a practical solution if disk space is not an issue and
the objective is to archive the documents in their original
format.
-
Binary
storage. This
option, new in Oracle Database 11g, stores data in a
postparse binary format designed specifically for XML data. This
option has several advantages over unstructured storage, in that
it is XML-schema-aware, allowing better disk space efficiency
and query performance. Although this option offers incredible
performance compared to that of unstructured storage, it does
not have the same query performance as structured storage.
-
Structured
storage. Also
known as schema-based storage, this option uses an
object-relational model to store XML documents in the database.
This storage option is the most efficient in terms of disk space
and relational access. It also has the highest overhead during
file insertion and requires additional preparation for schema
registration.
The perception of the
size and complexity of an XML document can differ greatly, depending
on the organization. On one hand, for online transaction processing
(OLTP) databases using XML for their electronic data interchange
(EDI) or other transactional data exchange, a file with several
thousand lines might be regarded as a very large file. On the other
hand, a multiterabyte data warehouse might regularly process XML
documents measured in gigabytes and not consider a file to be large
unless it contains millions of lines. The same concept holds for the
perceived complexity of an XML document.
For purposes of this
article, a document is considered "complex" if it has the following
properties:
-
It is
single-rooted, with multiple namespaces.
-
It has flexible
XML definitions, allowing for great variations while maintaining
validity.
-
It has recursive
or circular/cyclical references.
-
It has nonstatic
XML schemas.
In this article, XML
documents are considered "large" if they are single-rooted and are
greater than 20MB. These properties introduce certain scalability
and management considerations that must be addressed for a robust
enterprise solution.
There are no golden
rules for choosing the best storage option. Based on the file
structure, performance objectives, available resources, and expected
volume of data, the best option will vary. If you cannot decide
which storage option is the best for your particular requirement, it
is worthwhile to try the different formats and determine which is
optimal for your specific needs.
Maximizing Throughput
with Structured Storage
Although there is an
overhead cost for inserting files when you are using the structured
storage option, you can reduce this cost by splitting large
documents into smaller pieces. For example, if there is a 700MB
single-rooted XML file, it may be possible to split it into 10
smaller files that are valid with the XML schema. Inserting 10
different 70MB files is much faster than inserting a single 700MB
file and achieves the same end result. The level of concurrency
should be determined by the available processing power of the
database. This strategy takes advantage of the database concurrency
and maximizes throughput.
Another consideration
when using Oracle XML DB is that the insertion time for a single XML
file is generally limited by a single CPU's speed. In other words,
having multiple processors does not help the throughput of a single
document. For example, consider a situation in which a complex,
single-rooted 700MB document needs to be inserted by use of
schema-based storage. The total time to insert this file might be 10
minutes with a 1.35GHz processor, whether the database has 12 or 72
CPUs (assuming that at least one CPU is available in each scenario).
Unfortunately,
calculating these comparisons is not an exact science. Each database
may perform differently, depending on several factors, including the
operating system, available processing power, memory, and schema
definition. The best way to optimize performance in your environment
is to implement these strategies and determine the benefits of each
of them.
Oracle XML DB Versus Commercial Off-the-Shelf
(COTS) ETL Products
Several commercial
extract, transform, and load (ETL) tools are available for loading
data from files into the database. These tools usually feature a
simple front end with drag-and-drop capability and can hide the
complexity of the actual process. When you are creating an XML
loading process in a commercial ETL tool, it is necessary to define
the fields that need to be extracted. If a certain XPath is not
specified, the data will not be collected from the document.
In the case of complex
XML documents, the advantage of Oracle XML DB is that it takes a
database-centric view of documents and shreds each document into an
object-relational model. When the file is successfully inserted into
the database, any data contained within that file is readily
accessible without reparsing.
One of the best
benefits of Oracle XML DB is that it is a standard feature of Oracle
Database and does not require additional licensing. But if licensing
fees were not a concern for an organization, why would it use Oracle
XML DB for its XML content management instead of a tool from a
company with ETL as its primary focus? To understand the answer, it
is necessary to understand how Oracle XML DB technology meets the
unique requirements of managing very complex XML.
The Challenges of Complex XML Content
When XML content is
flexible, frequently changing, recursive, and very large, a
developer will undoubtedly encounter certain challenges that might
be absent from simpler formats. One possibility with a complex XML
document is that its XML schema might be very large and allow a
great amount of flexibility while remaining valid. The use of
flexible XML schemas is a common strategy for supporting
industrywide standards while accommodating company-specific
requirements.
For example, consider
an XML schema that is adopted as a standard for three companies. In
addition to the shared or standard elements, this XML schema will
need to include all of the company-specific definitions for each of
the companies. To support this requirement, the XML schema is
designed with a great deal of flexibility by use of generic
container elements that reference all possible company-specific
elements and allow most element references to occur zero or more
times. As a result, documents with completely different elements are
valid with the same XML schema. From a development perspective, this
flexibility makes it difficult to write XML parsers to extract the
necessary data, because the occurrence of elements is difficult to
predict. Because COTS ETL tools and custom parsers require a
specific XPath for extracting data, every possible XPath would need
to be checked to guarantee the capture of all data from the
document.
This XML schema could
be used as an industry standard for transmitting sales and inventory
data. Note that in the standardData.xsd namespace, both the
childContainer element and the companySpecificContainer element are
optionally self-referencing. In this particular definition, this
design enables each individual company to decide the granularity of
its data using the parent/child relationship. This schema also gives
each company the option of including inventory data, sales data, or
both. It further enables each company to include zero or more
stores, products, and sales, based on its individual needs but
within the same flexible format.
These documents are
very different, but both are valid in the XML schema. This design
facilitates the use of industry wide standards across different
companies, by using a single, flexible format in the XML schema.
Furthermore, the use of recursive references for generic container
elements enables each company to decide how much detail to include
while maintaining compliance with the XML schema. For example, one
XPath for extracting quantity-on-hand data from CompanyABC.xml would
be
'/rootElement/fileData/xn:childContainer/xn:CompanySpecificContainer/
xn:CompanySpecificContainer/xn:attributes/cs:csProduct/cs:value'
However, if this company wanted to include store
data with a finer granularity, it could add another child element
for substores, such as
'/rootElement/fileData/xn:childContainer/xn:CompanySpecificContainer/xn:CompanySpecificContainer/
xn:CompanySpecificContainer/xn:attributes/cs:csProduct/cs:value'
The design of this XML
schema enables the files to be rich with content that is determined
by each individual company. By using the generic container elements
and optional references, each document can begin to resemble a
database of its own. Due to the unpredictability of each file,
managing this XML with custom code or a COTS ETL tool would be an
intense development effort that would likely require continuous
maintenance and support.
The Oracle XML DB Solution
This problem with
XPath mapping, as demonstrated in the above example, does not exist
with Oracle XML DB, because the entire file is stored in the
database. As soon as the document has been inserted, the content
from the document is immediately available for query. Regardless of
how flexible the XML schema might be, any data contained in the
document can be accessed with the appropriate XPath. This provides
an unmatched advantage that maximizes availability and minimizes the
cost of maintenance.
To implement the XML
DB solution, start by registering the XML schema (the definition
files and documents are in a directory called XML_TEST):
begin
DBMS_XMLSCHEMA.REGISTERSCHEMA(
schemaurl => 'CompanySpecific.1.0.xsd',
schemadoc => BFILENAME ('XML_TEST','CompanySpecific.1.0.xsd')
);
DBMS_XMLSCHEMA.REGISTERSCHEMA(
schemaurl => 'standardData.xsd',
schemadoc => BFILENAME ('XML_TEST','standardData.xsd')
);
DBMS_XMLSCHEMA.REGISTERSCHEMA(
schemaurl => 'startData.xsd',
schemadoc => BFILENAME ('XML_TEST','startData.xsd')
);
end;
/
Now that you have created the object-relational
structure for our XML schema, you are ready to insert XML files into
the default table (specified in the annotation of the root element).
insert into XML_DEFAULT values (XMLTYPE(BFILENAME
('XML_TEST','CompanyABC.xml'),nls_charset_id('AL32UTF8')));
/
The file is successfully inserted in a fraction of a
second. Immediately the data is available for relational access.
Here is an example of a query access that describes the current
inventory:
SELECT extractValue(object_value,'/rootElement/fileFooter/@timeStamp') start_date,
extractValue(object_value,'/rootElement/fileHeader/@companyName') companyName,
extractValue(object_value,'/rootElement/fileHeader/@fileFormat') fileFormat,
extractValue(value(b),'/xn:childContainer/xn:attributes/xn:childLabel',
'xmlns:xn="standardData.xsd"') childLabel,
extractValue(value(b),'/xn:childContainer/xn:attributes/xn:childType',
'xmlns:xn="standardData.xsd"') childType,
extractValue(value(c),'/xn:CompanySpecificContainer/xn:attributes/xn:csDataType',
'xmlns:xn="standardData.xsd"') csDataType,
extractValue(value(c),'/xn:CompanySpecificContainer/xn:attributes/xn:csDataFormat',
'xmlns:xn="standardData.xsd"') csDataFormat,
extractValue(value(c),'/xn:CompanySpecificContainer/xn:attributes/cs:csStore/cs:label',
'xmlns:xn="standardData.xsd" xmlns:cs="CompanySpecific.1.0.xsd"') csStoreLabel,
extractValue(value(c),'/xn:CompanySpecificContainer/xn:attributes/cs:csStore/cs:name',
'xmlns:xn="standardData.xsd" xmlns:cs="CompanySpecific.1.0.xsd"') csStoreName,
extractValue(value(c),'/xn:CompanySpecificContainer/xn:attributes/cs:csStore/cs:value',
'xmlns:xn="standardData.xsd" xmlns:cs="CompanySpecific.1.0.xsd"') csStoreValue,
extractValue(value(c),'/xn:CompanySpecificContainer/xn:CompanySpecificContainer/xn:attributes/xn:csDataType',
'xmlns:xn="standardData.xsd"') csDataTypeL2,
extractValue(value(d),'/xn:CompanySpecificContainer/xn:attributes/xn:csDataFormat',
'xmlns:xn="standardData.xsd"') csDataFormatL2,
extractValue(value(d),'/xn:CompanySpecificContainer/xn:attributes/cs:csProduct/cs:label',
'xmlns:xn="standardData.xsd" xmlns:cs="CompanySpecific.1.0.xsd"') csProductLabel,
extractValue(value(d),'/xn:CompanySpecificContainer/xn:attributes/cs:csProduct/cs:name',
'xmlns:xn="standardData.xsd" xmlns:cs="CompanySpecific.1.0.xsd"') csProductName,
extractValue(value(d),'/xn:CompanySpecificContainer/xn:attributes/cs:csProduct/cs:value',
'xmlns:xn="standardData.xsd" xmlns:cs="CompanySpecific.1.0.xsd"') csProductValue
from
XML_DEFAULT a
,TABLE(XMLSequence(Extract(object_value, '/rootElement/fileData/xn:childContainer',
'xmlns:xn="standardData.xsd" xmlns="startData.xsd"'))) b
,TABLE(XMLSequence(Extract(value(b), '/xn:childContainer/xn:CompanySpecificContainer',
'xmlns:xn="standardData.xsd"'))) c
,TABLE(XMLSequence(Extract(value(c), '/xn:CompanySpecificContainer/xn:CompanySpecificContainer',
'xmlns:xn="standardData.xsd"'))) d
/
This solution allows
for the management of any documents that conform to the XML schema
definition. Oracle XML DB's database-centric view of XML documents
enables an XML document to be content-rich while providing a layer
of abstraction from the complexities of development that would be
unavoidable with any other ETL technology.
This example
demonstrates the unmatched XML storage-and-retrieval capabilities of
Oracle XML DB, but a complex XML schema is also characterized as
having frequent changes to the XML schema definition. How does
Oracle XML DB facilitate changes to the XML schema?
Managing XML Schema Changes
Changes to the schema
definitions have always been problematic for XML-based solutions.
This is expected, because an XML schema is supposed to be used to
validate an XML document. If the document contains elements that are
not defined in the schema, the document is not valid. In the past,
this has been a weakness of Oracle XML DB, because changes to
registered XML schemas required a complex and costly operation using
the copyEvolve procedure.
To address this
limitation, Oracle has introduced in Oracle XML DB 11g a
new procedure called inPlaceEvolve, which allows the same schema
modifications to be an online operation that does not require data
movement. Instead, this procedure modifies the database objects
created during the schema registration while leaving the related
data in place. This enhancement is crucial in addressing the
frequent definition changes common with complex XML schemas.
With Oracle XML DB 11g,
this modification is fast and requires minimal resources. The new
procedure requires the schema URL and an XMLType document (XMLDiff)
that conforms to the xdiff XML schema. The XMLDiff document is a
specially formatted document that reflects the changes in the XML
schema. Instead of your having to manually create the XMLDiff
document, it can be done automatically with the xmldiff function in
Oracle Database.
First, create a new
schema file with the updated XML schema definition
CompanySpecific.1.1.xsd. Then use the Oracle Database xmldiff
function with the old schema as the first parameter and the new
schema as the second parameter:
var oldSchemaDoc clob;
var newSchemaDoc clob;
begin
:oldSchemaDoc := xmltype( bfilename ( 'XML_TEST', 'CompanySpecific.1.0.xsd') ,
nls_charset_id('AL32UTF8') ).getClobVal();
:newSchemaDoc := xmltype( bfilename ( 'XML_TEST', 'CompanySpecific.1.1.xsd') ,
nls_charset_id('AL32UTF8') ).getClobVal();
end;
/
You can view the XMLDiff document by using
select xmldiff(xmltype(:oldSchemaDoc),xmltype(:newSchemaDoc)).getClobVal() from dual;
/
For simplicity, use a CLOB variable to store the
XMLDiff document.
var diffXMLDoc clob;
begin
select xmldiff(xmltype(:oldSchemaDoc),xmltype(:newSchemaDoc)).getClobVal()
into :diffXMLDoc from dual;
end;
/
Now that the XMLDiff document is available, the
inPlaceEvolve procedure can be called to modify the XML schema
online.
SQL> alter session set events = '31150 trace name context forever, level 0x200000';
Session altered.
SQL>
SQL> BEGIN
2 DBMS_XMLSCHEMA.inPlaceEvolve('CompanySpecific.1.0.xsd', xmltype(:diffXMLDoc) );
3 END;
4 /
PL/SQL procedure successfully completed.
An examination of the trace file shows that the
database type that represents the modified complex element was
altered to include the new element:
change to ct sqltype = csProduct1046_T
------------ QMTS Executing SQL ------------
ALTER TYPE "XMLTEST"."csProduct1046_T" ADD ATTRIBUTE "class" VARCHAR2(4000 CHAR)
CASCADE NOT INCLUDING TABLE DATA
/
--------------------------------------------
The introduction of
this new feature makes it possible to efficiently manage changes to
XML schemas, even with a large volume of documents, without
requiring a window of unavailability during the schema evolution.
This is a major enhancement that improves the feasibility of using
Oracle XML DB for large enterprise solutions.
Practical Applications
Oracle XML DB 11g
offers a complete, efficient content management solution that is far
more useful and practical than any of the alternatives. By taking a
database-centric view of each document, Oracle XML DB provides an
innovative and powerful method for storing and retrieving XML
content. The various storage options provide the ability to
efficiently manage all documents, regardless of size and complexity.
The new feature for efficiently modifying XML schema definitions
makes Oracle XML DB an ideal solution for frequently changing
content. These features meet the challenges of managing complex XML
content and provide a solution that deserves serious consideration
by all organizations. |