-
In the previous video, we learned the basics of XML.
-
In this video, we're
-
going to learn about Document Type Descriptors,
-
also known as DTDs, and also ID and ID ref attributes.
-
We learned that well-formed XML
-
is XML that adheres to
-
basic structural requirements: a single
-
root element, matched tags with
-
proper nesting, and unique
-
attributes within each element.
-
Now we're going to learn about what's known as valid XML.
-
Valid XML has to adhere
-
to the same basic structural requirements
-
as well-formed XML, but it
-
also adheres to content specific specifications.
-
And we're going to learn two languages for those specifications.
-
One of them is Document Type
-
Descriptors or DTDs, and the
-
other, a more powerful language, is XML schema.
-
Specifications in XML
-
schema are known as XSDs, for XML Schema Descriptions.
-
So as a reminder, here's how
-
things worked with well-formed XML documents.
-
We sent the document to a
-
parser and the parser would
-
either return that the document
-
was not well-formed or it would return parsed XML.
-
Now let's consider what happens with valid XML.
-
Now we use a validating
-
XML parser, and we have
-
an additional input to the
-
process, which is a
-
specification, either a DTD or an XSD.
-
So that's also fed to the parser, along with the document.
-
The parser can again
-
say the document is
-
not well formed if it doesn't meet the basic structural requirements.
-
It could also say that the
-
document is not valid, meaning
-
the structure of the document doesn't
-
match the content specific specification.
-
If everything is good, then
-
once again "parsed XML" is returned.
-
Now let's talk about the document-type descriptors, or DTDs.
-
We see a DTD in
-
the lower-left corner of the
-
video, but we won't look
-
at it in any detail, because we'll
-
be doing demos of DTDs a little later on.
-
A DTD is a language
-
that's kind of like a grammar, and
-
what you can specify in that language is for
-
a particular document what elements
-
you want that document to contain,
-
the tags of the elements,
-
what attributes can be in
-
the elements, how the different types of elements can be nested.
-
Sometimes the ordering of the
-
elements might want to be
-
specified, and sometimes the number of occurrences of different elements.
-
DTDs also allow the
-
introduction of special types of
-
attributes, called id and idrefs.
-
And, effectively, what these allow you
-
to do is specify pointers within
-
a document, although these pointers are untyped.
-
Before moving to the demo,
-
let's talk a little bit about
-
the positives and negatives about
-
choosing to use a DTD
-
or and XSD for one's XML data.
-
After all, if you're
-
building an application that encodes
-
its data in XML, you'll have
-
to decide whether you want the
-
XML to just be well formed
-
or whether you want to
-
have specifications and require the
-
XML to be valid to satisfy those specifications.
-
So, let's put a few positives
-
of choosing a later of requiring a DTD or an XSD.
-
First of all, one of
-
them is that when you write your
-
program, you can assume
-
that the data adheres to a specific structure.
-
So programs can assume a
-
structure and so the
-
programs themselves are simpler because they don't
-
have to be doing a lot of error checking on the data.
-
They'll know that before the data
-
reaches the program, it's been
-
run through a validator and it does satisfy a particular structure.
-
Second of all, we talked
-
at some time ago about
-
the cascading style sheet language
-
and the extensible style sheet languages.
-
These are languages that take XML
-
and they run rules on it
-
to process it into a different form, often HTML.
-
When you write those rules, if
-
you note that the data
-
has a certain structure, then those
-
rules can be simpler, so like
-
the programs they also can
-
assume particular structure and it makes them simpler.
-
Now, another use for DTDs
-
or XSDs is as a
-
specification language for conveying
-
what XML might need to look like.
-
So, as an example if you're
-
performing data exchange using
-
XML, maybe a company is
-
going to receive purchase orders in
-
XML, the company can
-
actually use the DTD as
-
a specification for what
-
the XML needs to look
-
like when it arrives at
-
the program it's going to operate on it.
-
Also documentation, it can
-
be useful to use one of
-
the specifications to just document
-
what the data itself looks like.
-
In general, really what
-
we have here is the benefits of typing.
-
We're talking about strongly typed data
-
versus loosely-typed data, if you want to think of it that way.
-
Now let's look at when we might prefer not to use a DTD.
-
So what I'm going describe down
-
here is the benefits of not using a DTD.
-
So the biggest benefit is flexibility.
-
So a DTD makes your
-
XML data have to conform to a specification.
-
If you want more flexibility or
-
you want ease of change
-
in the way that the data is
-
formatted without running into
-
a lot of errors, then, if
-
that's what you want,
-
then the DTD can be constraining.
-
Another fact is that DTDs can
-
be fairly messy and this
-
is not going to be obvious
-
to you yet until we get
-
into the demo, but if
-
the data is irregular, very irregular, then
-
specifying its structure can
-
be hard, especially for irregular documents.
-
Actually, when we see
-
the schema language, we'll
-
discover that XSDs can be,
-
I would say, really messy, so they can actually get very large.
-
It's possible to have a
-
document where the specification of
-
the structure of the document is
-
much, much larger than the
-
document itself, which seems not
-
entirely intuitive, but when we get to
-
learn about XSDs, I think you'll see how that can happen.
-
So, overall, this is
-
the benefits of nil typing.
-
It' s really quite similar to
-
the analogy in programming languages.
-
The remainder of this video will
-
teach about the DTDs themselves through a set of examples.
-
We'll have a separate video
-
for learning about XML schema and XSDs.
-
So, here we are
-
with our first document that we're
-
going to look at with a document type descriptor.
-
We have on the left the document itself.
-
We have on the right the document-type
-
descriptor, and then we have
-
in the lower right a command
-
line shell that we're going to use to validate the document.
-
So this is similar data to
-
what we saw on the last video,
-
but let's go through it just to see what we have.
-
We have an outermost element called
-
bookstore, and we have two books in our bookstore.
-
The first book has an ISBN number, price and editions.
-
As attributes and then it
-
has a sub-element called title, another
-
sub-element called authors with two
-
authors underneath; first names and last names.
-
The second book element is
-
similar, except it doesn't have a edition.
-
It also has, as we see, a remark.
-
Now let's take a look at
-
the DTD and I'm just going
-
to walk through DTD, not
-
too slowly, not too fast, and
-
explain exactly what it's doing.
-
So the start of the
-
DTD says this a
-
DTD named bookstore and the
-
root element is called bookstore,
-
and now we have the first grammar-like construct.
-
So these constructs, in fact, are
-
a little bit like regular expressions if you know them.
-
What this says is that
-
a bookstore element has as
-
its sub-element any number
-
of elements that are called book or magazine.
-
We have book or magazine.
-
We don't have any magazines yet but we'll add one.
-
And then this star says, zero or more instances.
-
It's the clean and close operator for those of you familiar with regular expression.
-
Now let's talk about
-
what the book element
has, so that's our next specification.
-
The book element has a
-
title followed by authors,
-
followed by an optional remark.
-
So now we don't have an
-
"or", we have a comma, and
-
that says that these are going to
-
be in that order - title,
-
authors, and remark and the
-
question mark says that the remark is optional.
-
Next we have the attributes of our book elements.
-
So this bang attribute list
-
says we're going to describe
-
the attributes and we're going
-
to have three of them: the ISBN,
-
the price, and the edition.
-
C data is the type of the attribute.
-
It's just a string.
-
And then required says that
-
the attribute must be present, whereas
-
implied says it doesn't have to be present.
-
As you may remember, we have one book that doesn't have an edition.
-
Our magazines are simply going
-
to have titles and they're going
-
to have attributes that are month and year.
-
Again, we don't have any magazines yet.
-
A title is going to
-
consist of string data.
-
So here we see our title of first course and database system.
-
You can think of that as the leaf data in the XML tree.
-
And when you have a leaf that
-
consists of text data, this is
-
what you put in the DTD
-
- just take my word for it:
-
hash PC data in parentheses.
-
Now our authors are an element that still has structure .
-
Our authors have a sub-element,
-
author sub-elements or elements,
-
and we're going to
-
specify here that the
-
author's element must have one
-
or more author subelements.
-
So that's what the plus
-
is saying here, again taken from regular expressions.
-
"Plus" means one or more instances.
-
We have the remark, which
-
is just going to be pc data or string data.
-
We have our authors which consist
-
of a first name sub-element and
-
a last-name sub-element, and in that order.
-
And then finally, our first names and last names are also strengths.
-
So, this is the entire
-
DTD and it describes
-
in detail the structure
-
of our document.
-
Now we have a command, we're
-
using something called xmllint,
-
that will check to see if the document meets the structure.
-
We'll just run that command
-
here with a couple of options, and
-
it doesn't give us any output
-
which actually means that our document is correct.
-
Well be making some edits and seeing when our document is not correct what happens when we run the command.
-
So let's make our first edit,
-
let's say that we decide that
-
we want the additional attribute
-
of our books to be "required" rather than "applied".
-
So we'll change the DTD.
-
We'll save the file and now when we run our command.
-
So as expected we got an
-
error, and the error said
-
that one of our book elements does not have attribute addition.
-
Now that addition is required, every book element ought to have it.
-
So let's add an addition to our second book.
-
Let 's say that it's
-
the second edition, save the
-
file, we'll validate our
-
document again, and now everything is good. Let's
-
do an edit to the document
-
this time to see what
-
happens when we change the
-
order of the first name and the last name.
-
So we've swapped Jeffrey Ullman to be Ullman Jeffery.
-
We validate our document, and now
-
we see we got an error
-
because the elements are not in the correct order.
-
In this case, let's undo that
-
change, rather than change our DTD.
-
Let's try another edit to our document.
-
Let's add a remark to our first book.
-
But what we'll do is
-
we'll leave the remark empty, so
-
we'll add a opening and then
-
directly a closing tag, and let's see if that validates.
-
So, it did validate.
-
And in fact when we have
-
PC data as the type
-
of an element it's perfectly acceptable to have a empty element.
-
As a final change, let's add a magazine to our database.
-
You'll have to bear with me as I type.
-
I'm always a little bit slow.
-
So we see over here that
-
when we have a magazine there are
-
two required attributes, the month and the year.
-
So, let's say the month is
-
January and the year,
-
let's make that 2011,
-
and then we have a title for our magazine.
-
Here.
-
We'll go down here.
-
Our title, let's make it National Geographic.
-
We'll close the tag, title tag.
-
And then, sorry again about my typing.
-
Let's go ahead and validate the document.
-
we saw premature end of something or other.
-
We forgot our closing tag for
-
magazine, let's put that in.
-
My terrible typing, and here we go.
-
Let's validate, and we're done.
-
Now we're gonna learn about and id rep attributes.
-
The document on the left side
-
contains the same data as
-
our previous document but completely restructured.
-
Instead of having authors as
-
subelements of book elements,
-
we're going to have our authors listed separately,
-
and then effectively point from the books to the authors of the book.
-
We'll take a look at the
-
data first, and then
-
we'll look at the DTD that describes the data.
-
Let's actually start with the
-
author, so our bookstore element
-
here has two subelements that are books and three that are authors.
-
So, looking at the authors, we have
-
the first name and last name
-
as sub-elements as usual, but
-
we've added what we call the ident attribute.
-
That's not a keyword; we've just
-
called the attribute ident, and
-
then for each of the three authors,
-
we've given a string value
-
to that attribute that we're going
-
to use effectively for the pointers in the book.
-
So we have our three authors, now let's take a look at the books.
-
Our book has the ISBN number and price.
-
I've taken the addition out for now.
-
special attribute called authors.
-
Authors is an ID reps
-
attribute, and it's value
-
can refer to one or
-
more strings that are ID attributes.
-
attributes in another element.
-
So that's what we're doing here.
-
We're referring to the two author elements here.
-
And in our second book we're referring to the three author elements.
-
We still have the title subelement
-
and we still have the remarks subelement.
-
And furthermore, we have one
-
other cute thing here, which is,
-
instead of referring to
-
the book by name within the
-
remark when we're talking about
-
the other book, we have another type of pointer.
-
So we'll specify that the
-
ISBN is an ID
-
for books and then this
-
is an id reps attribute
-
that's referring to the id of the other book.
-
The DTD on the right that describes the structure of this document.
-
This time our bookstore is
-
going to contain zero or more
-
books followed by zero or more authors.
-
Our books contain a title and
-
an optional remark is subelements and
-
now they contain three attributes,
-
the IDBN which is
-
now a special type of
-
attribute called and ID, the
-
price,which is the string
-
value as usual and the
-
authors which is the special type
-
called id reps. Let's keep
-
going, our title is just string Value as usual.
-
A remark, here this is a actually interesting construct.
-
A remark consist of the
-
PC data which is string,
-
or a book reference and then
-
zero more instances of those.
-
This is the type of construct
-
that can be used to mix
-
strings and sub elements within an element.
-
So anytime you want an
-
element that might have some
-
strings and then another element and then more string value.
-
That's how it's done.
-
PC data or the element type zero or more.
-
Then we have our book reference
-
which is actually an empty element it's
-
only interesting because is has
-
an attribute so let's go
-
back here we see our book
-
wrap here it actually doesn't
-
have any data or sub
-
elements, but it has an
-
attribute called book and that is an ID ref.
-
That means it refers to an
-
ID attribute of another, another
-
element.
-
Now we have our authors the first
-
name and the last name and
-
our author attributes have again
-
an ID and we're calling it the ident.
-
And finally the first name and last name are string values.
-
This may seem overwhelming but the
-
key points in this DTD
-
are the ID the attributes.
-
So the ID attributes, the ISBN
-
attributes in the book, and
-
the ident, wherever it
-
went, ident attribute in the author
-
are special attributes, and by
-
the way, they do need to be
-
unique values for those attributes,
-
and they're special in that
-
ID refs attributes can refer
-
to them, and that will be checked as well.
-
Now, I did want to
-
point out that the book
-
reference here says ID ref singular.
-
When you have a singular
-
ID ref then the string has
-
to be exactly one ID value.
-
When you have the plural ID refs.
-
Then the string of the
-
attribute is one or
-
more ID ref value, I'm
-
sorry one or more ID values separated by spaces.
-
So it's a little bit clunky, but it does seem to work.
-
Now let's go to our command line, and let's validate the document.
-
So the document is in fact valid.
-
That's what it means when we
-
get nothing back, and let's
-
make some changes, as we did
-
before, to explore what structure
-
is imposed and what's checked with this DTD in the presence.
-
IDs and ID refs.
-
As a first change, let's change
-
this ID, this identifier
-
HG to JU.
-
That should actually cause a couple of problems
-
when we do that let's
-
validate the document and see what happens.
-
And we do in fact get two different errors.
-
The first error says that
-
we have two instances of "JU".
-
As you can see here, we
-
now have JU twice where
-
ID values do have to be unique.
-
They have to be globally unique throughout the document.
-
The second error that occurred
-
when we changed HG to JU
-
is we effectively have a dangling pointer.
-
We refer to HG here
-
in this ID refs attribute but there's
-
no longer an element whose value is HG.
-
So that's an error as well.
-
So let's change it back to
-
HG just so our document is valid again.
-
Now let's make another change, let's take our book reference.
-
We can see that our book reference is referring to the other book.
-
We're in the complete book here
-
and the comment, the remark is
-
referring to the first course
-
through the ISBN number, but let's
-
change this string instead to refer to HG.
-
So now we're actually referring
-
to an author rather than another book.
-
Let's check if the document validates.
-
In fact it does.
-
And that shows that the
-
pointers when you have a DTD are untyped.
-
So it does check to make
-
sure that this is an
-
id of another element, but we
-
weren't able to specify that
-
it should be a book element
-
in our DTD, and since we're
-
not able to specify it, of
-
course it's not possible to check it.
-
We will see that in XML
-
schema, we can have typed
-
pointers but it's not possible to have them in DTDs.
-
The last change I'm going to
-
show is to add a
-
second book reference within our remark.
-
So as I pointed out over
-
here, when we write PC data
-
or in an element type
-
followed by the [xx] closure, the
-
zero or more star, that
-
means we can freely mix text and sub-elements.
-
So just right in the middle here, let's put a book reference.
-
and we can put, let's say
-
book equals JU, and that
-
will be the end of our reference
-
there and now we
-
see that we have text followed
-
by a subelement followed by more
-
text then so on.
-
That should validate fine, and in fact it does.
-
That completes our demonstration of
-
XML documents with DTDs.