One important aspect of the web is linking to external pages, and nothing is more frustrating than one of those links breaking. Webmasters are always rearranging pages and changing the technology behind their websites. Obviously in the semantic web, which depends entirely on the concept of links, keeping URLs valid is one of our primary concerns. One way you can help keep links to your own sites valid is to make sure that the URL contains no redundant or non-semantic information. For example, the URL
http://example.org/cgi-bin/photos/2004-08-09.php?type=cat has several things wrong with it. One is that it has the prefix cgi-bin that serves no purpose. No visitor cares that they are visiting the dynamic part of your site. They just want to see a web page. Another problem is that it allows the underlying technology to shine through; technology that might be subject to change at a moment’s notice. It shows the PHP with the file extension and the argument passed to it as ?type=cat.
One simple method of hiding the file extensions is to enable the Multiviews option in your Apache .htaccess file. In your top-level web directory, create the file .htaccess (yes, with the period at the front) or if it already exists, just edit it. Add the line Options +Multiviews. This tells Apache that when /cgi-bin/photos/2004-08-09 is requested, it will find the file with that extension that best fits what the browser can display, based on mime type. You could also use this technique to offer different versions of the same thing. Say you have an SVG image, but you know that most browsers don’t support SVG. You could just offer the SVG file pic.svg along with a JPEG file pic.jpg and link to the filename pic without the extension. Apache will content negotiate with the browser and pick the right one.
A more complete solution for permanent links is the mod_rewrite module for Apache. It lets you tranform incoming URLs to anything you like. It is both very powerful and very complicated. Take our example; you could tranform the incoming URL /photos/cats/2004-08-09, which is very succinct, into /cgi-bin/photos/2004-08-09.php?type=cat all transparently to the user. Best of all, this allows for you moving the underlying filesystem in any way you like and you can still mangle the link into the new form. Thus, nothing stops your links from always being valid.
For more information, see Tim Berners-Lee’s excellent article on permanent links.
XHTML has been the successor to HTML since 2000, but it hasn’t seen much uptake because most of its advantage lies in its use of XML as the tranport format. Using XML would seem to be an advantage if you know much about it, but in the wacky world wide web, nothing is quite so simple. XHTML allows you to send pages as text/html if you like, but it really prefers application/xhtml+xml, application/xml, or text/xml in that order. The problem is twofold. First, some web developers balk at the pickiness of XML. If a document is not well-formed (that is, it contains syntax errors like <p<b>), user agents are supposed to refuse to render it. Second, IE fails to recognize the application/xhtml+xml mime type, instead offering to download the page; if you try to give it application/xml, it doesn’t understand that it’s looking at XHTML and doesn’t do any styling.
Naturally, this situation has dampered developer enthusiasm for XHTML. But there are real advantages here: a cleaned up HTML with few presentation details inside a robust, internationalizeable markup, and XML plays well with other XML formats, like MathML or SVG. XHTML is the way of the 21st century, and it’s time we used it.
The first problem is just web developers being stubborn. If your document is not well-formed, how can you expect user agents to properly understand it? You can’t always rely on what current browsers happen to do when given malformed markup.
The second problem we can actually do something about. We could do some sort of browser sniffing and serve the document as XML to those that can and text/html to those who can’t, but browser sniffing is notoriously bad and not all browsers declare their capability to read XML. We would also have to make sure that we never took advantage of XML because we had to remain backward-compatible for those reading the HTML version, and the differences are large enough to cause problems when XHTML is sent as text/html. Maintaining two versions of the document is too large a headache to be worth it. Thankfully, the only modern browser holding out is IE, and we can surprisingly solve that problem. The W3C itself has discovered a workaround for IE, letting it display XML as HTML. First, you are going to want to make the top of your XHTML file look like this:
<?xml version="1.0" encoding="utf-8"?>
<!-- To trick Microsoft Internet Explorer -->
<?xml-stylesheet href="copy.xsl" type="text/xsl"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xml:lang="en-us" xmlns="http://www.w3.org/1999/xhtml">
Most of that is standard fare required by the W3C for XHTML files. However, you’ll notice the XML stylesheet linking to copy.xsl, a file that will look like this:
<stylesheet version="1.0"
xmlns="http://www.w3.org/1999/XSL/Transform">
<template match="/">
<copy-of select="."/>
</template>
</stylesheet>
This is an XSL transformation that essentially makes a copy of the current page. No change for intelligent browsers, but for some reason IE thinks you are translating from XML to HTML and displays it as such.
The last step is to serve your pages as application/xml. IE will still not touch application/xhtml+xml, so we’ll have to settle for the next best thing. You can do this by changing the file extension to .xml (which shouldn’t mess up your links if you’ve been paying attention) or configuring your web server to serve it as application/xml for you.
This works on Windows IE 6.0 and should work on Windows IE 5.0/5.5. It does not work on Mac IE 5. I am interested in reports of other versions of IE for which it does or does not work. Note that Google does understand application/xml, so your ranking shouldn't suffer.
And, of course, don’t forget to always validate!
The semantic web’s big deal is metadata: inter-linking sources of data and metadata all joined at the hip. A widely adopted format for metadata is RDF, which we will use to talk about our website and ourselves. You’ll want to create a file at the top-level of your website called metadata.rdf or whatever. It will basically look like this:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xml:lang="en-us"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:dc="http://purl.org/dc/elements/1.1/">
</rdf:RDF>
Inside of the rdf:RDF tags, you will insert metadata of various kinds. I suggest you read up on common kinds, like FOAF or Dublin Core. Here is some sample metadata for Jane Doe and her page about airplanes:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xml:lang="en-us"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<foaf:Person rdf:nodeID="janedoe">
<foaf:name>Jane Doe</foaf:name>
<foaf:mbox rdf:resource="janedoe@example.org"/>
</foaf:Person>
<rdf:Description rdf:about="http://www.example.org/airplanes">
<dc:creator>Jane Doe</dc:creator>
<dc:title>Commercial Airplanes of the 1960s</dc:title>
<dc:date>2004-02-19</dc:date>
<foaf:maker rdf:nodeID="janedoe"/>
</rdf:Description>
</rdf:RDF>
There are all kinds of RDF vocabularies, covering a wide range of topics. I recommend you explore your options, because there’s a lot more you can do than the simple example above. You can stick as much metadata in the file as you like, about any number of people or pages. You are welcome to take a look at my metadata file as an example.
In order to let people know about your metadata, you need to add a tag to your XHTML pages to point to metadata.rdf. In your head section, add the line:
<link rel="meta" type="application/rdf+xml" href="metadata.rdf"/>
Now user agents reading your website will see a possible source of metadata if they want. Most will not use it, but some user agents can. You can also register your file with some of the FOAF repositories floating around for wider distribution.
Bam! You are now a part of the semantic web.