URIs and URLs

6 February 2013 - 5:05pm — Talat Chaudhri

One of the most ubiquitous identifiers on the World Wide Web is the Uniform Resource Locator (URL), which is most frequently seen expressed as an address in the syntax of the Hypertext Transfer Protocol (HTTP) in order to identify a resource on a server and either retrieve it or else receive a response about its status and/or location. It is so ubiquitous and familiar because it appears in the address bar of every Web browser and, by justifiable analogy, could be described as the digital counterpart of the street address. Every resource on the Web must have one, and the part of the URL that identifies a server, or domain name, is often a valuable commodity.

However, the term Uniform Resource Identifier (URI) has increasingly become the term of choice in the professional information community over recent years, often displacing the more commonly recognised term. To all practical purposes, these appear to be identical to URLs. The question must inevitably arise, then, “is there any difference between the two and, if so, what is it?”

In theory, the most basic identifier in the Web architecture is the URI. Each resource on the Web must have a unique URI that is uniformly employed to identify it. However, there is no requirement, in strictly theoretical terms, for it to be capable of locating that particular resource. That is to say that a URI does not theoretically have to resolve to the actual resource on a particular server. Its purpose is far simpler: purely to identify that resource. In essence, it is a unique string, or effectively a kind of name, for a resource.

The URL is a particular type of URI that, in contrast, must locate the resource. In order to do so, it must follow the syntax of a recognised protocol, typically HTTP or HTTPS but also XMPP, SMTP, FTP, MAILTO etc, and it must refer to a real resource in a real location on a real server. In the case of HTTP, the server should respond at the very least with an appropriate response to indicate if the resource has been temporarily or permanently removed, whether an alternative is available, and a number of other possible alternative responses defined by the standard HTTP response codes. Consequently, an HTTP URL looks like an address: the first part identifies the server and its domain name, with a suffix denoting the country of origin and/or its commercial or non-commercial status; there may be various locations, typically a folder tree, following this; finally, the name of the resource, often but not necessarily a filename, completes the Web address.

Why then is there any confusion? Typically, URIs are used as identifiers when markup languages such as XML and RDF are employed, for example in defining attributes and relationships in linked data, which enables Semantic Web services and, where real-world, physical objects are described, the Web of Things. Since it is useful either to describe the actual resource or else, if this is not possible for some reason such as it being subscription-only or being a physical rather than digital object, to provide a document containing a description of the resource instead. If such a URI does not resolve, it is not especially useful in this context, so the identifier that is chosen therefore coincides with the URL for that resource whilst also functioning as a URI or canonical unique identifier for it. This is entirely in keeping with a URL being a sub-class or type of URI.

Ironically, this leads to the curious situation that, while if an HTTP URL fails to resolve, it is considered a broken link that ought to be rectified, on the other hand there are no grounds to complain if a URI that merely happens to be expressed using the HTTP syntax within an RDF or XML document does not resolve. In reality, it is good practice in any case to ensure that these URIs do not mislead: the use of a syntax that implies the location of a resource should really point to one, even if it is not a strict technical requirement in that particular context.

There is another sub-class of URI, which is used in Web services, although it is not typically as visible as a URL because it is not exposed in Web browsers in the same high-profile way. The Uniform Resource Name (URN) has one similarity to the URL in the sense that it provides a more specific function than the wider super-class of URI. However, unlike the URL, it does not locate a resource. It provides a particular name scheme, expressed in a recognised protocol. For example, the International Standard Book Number (ISBN) can be expressed as a URN, e.g. urn:isbn:0-395-36341-1. The identifier cannot be free-form and must follow the syntax of the particular URN scheme. It is useful because it provides a reliable, uniform name for a resource that can be used to search elsewhere, either manually or programmatically by machine. In practice, it is especially useful for physical objects because it provides a means to discover multiple copies of the same object such as a book in more than one location, and compare their condition or other attributes such as price, availability and so on.

The URL may appear to be more superficially useful than the URN but, where both are used, they can be complimentary. It is useful for a Web service to be able to identify resources reliably by URN and then locate them by URL. As URLs can and do change, it can be helpful that a URN should not. The usefulness of a URN is based on the reliability of its ultimate provider in providing accurate information. This leads the to the situation where only URNs provided by large, authoritative public or commercial organisations are widely used, e.g. publishers and national libraries. Some widely used identifier schemes such as the Digital Object Identifier (DOI) operate in a similar way to URNs. In the case of the DOI, a deliberate decision was taken not to register a URN namespace. The DOI may be expressed as a delimited numerical string or as an HTTP URI that is expected to resolve as a URL as well. There is thus a certain practical overlap between the types of Web identifier, even though their functions are different.

Originally, there was intended to be a further type of identifier, the Uniform Resource Characteristic (URC). This was intended to carry metadata about the URL and/or URN, but was never developed. Instead, this is often done in XML and/or RDF contained in documents on the Web and pointing to the resources that they reference. Dublin Core is perhaps the most widely used schema because it is simplest, but there are many competing metadata schemas.

Key links:

Printer-friendly version