URI normalization: Difference between revisions

Content deleted Content added

Inline

Revision as of 09:14, 13 June 2011

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs may be equivalent.

Search engines employ URL normalization in order to assign importance to web pages and to reduce indexing of duplicate pages. Web crawlers perform URL normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached.

Normalization process

There are several types of normalization that may be performed. Some of them are semantics preserving and some are not.

Semantic preserving normalizations

The following normalizations are described in RFC 3986 ^[1] to result in equivalent URLs:

Converting the scheme and host to lower case. The scheme and host components of the URL are case-insensitive. Most normalizers will convert them to lowercase. Example:

HTTP://www.Example.com/ → http://www.example.com/

Capitalizing letters in escape sequences. All letters within a percent-encoding triplet (e.g., "%3A") are case-insensitive, and should be capitalized. Example:

http://www.example.com/a%c2%b1b → http://www.example.com/a%C2%B1b

Decoding percent-encoded octets of unreserved characters. For consistency, percent-encoded octets in the ranges of ALPHA (%41–%5A and %61–%7A), DIGIT (%30–%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.^[2] Example:

http://www.example.com/%7Eusername/ → http://www.example.com/~username/

Adding trailing / Directories are indicated with a trailing slash and should be included in URLs. Example:

http://www.example.com → http://www.example.com/

Removing the default port. The default port (port 80 for the “http” scheme) may be removed from (or added to) a URL. Example:

http://www.example.com:80/bar.html → http://www.example.com/bar.html

Removing dot-segments. The segments “..” and “.” are usually removed from a URL according to the algorithm described in RFC 3986 (or a similar algorithm). Example:

http://www.example.com/../a/b/../c/./d.html → http://www.example.com/a/c/d.html

These normalizations can be applied on URLs without changing the semantics.

Semantic changing normalizations

Applying the following normalizations result in a semantical different URL although it may refer to the same resource:

Removing directory index. Default directory indexes are generally not needed in URLs. Examples:

http://www.example.com/default.asp → http://www.example.com/

http://www.example.com/a/index.html → http://www.example.com/a/

Removing the fragment. The fragment component of a URL is usually removed. Example:

http://www.example.com/bar.html#section1 → http://www.example.com/bar.html

Removing IP. Check if the IP address is the same as its domain name. Example:

http://208.77.188.166/ → http://www.example.com/

Limiting protocols. Limiting different application layer protocols. For example, the “https” scheme could be replaced with “http”. Example:

https://www.example.com/ → http://www.example.com/

Removing duplicate slashes Paths which include two adjacent slashes should be converted to one. Example:

http://www.example.com/foo//bar.html → http://www.example.com/foo/bar.html

Removing “www” as the first domain label. Some websites operate in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first. For example, http://example.com/ and http://www.example.com/ may access the same website. Although many websites redirect the user to the non-www address (or vice versa), some do not. A normalizer may perform extra processing to determine if there is a non-www equivalent and then normalize all URLs to the non-www prefix. Example:

http://www.example.com/ → http://example.com/

Sorting the variables of active pages. Some active web pages have more than one variable in the URL. A normalizer can remove all the variables with their data, sort them into alphabetical order (by variable name), and reassemble the URL. Example:

http://www.example.com/display?lang=en&article=fred → http://www.example.com/display?article=fred&lang=en

Removing arbitrary querystring variables. An active page may expect certain variables to appear in the querystring; all unexpected variables should be removed. Example:

http://www.example.com/display?id=123&fakefoo=fakebar → http://www.example.com/display?id=123

Removing default querystring variables. A default value in the querystring will render identically whether it is there or not. When a default value appears in the querystring, it can be removed. Example:

http://www.example.com/display?id=&sort=ascending → http://www.example.com/display

Removing the "?" when the querystring is empty. When the querystring is empty, there is no need for the "?". Example:

http://www.example.com/display? → http://www.example.com/display

Standardizing character encoding. When the URL contains special characters such as a slash, dot, or space, check to see if the encoded forms such as "%2F" and the unencoded forms such as "/" are the same. Example:

http://www.example.com/display?category=foo/bar+baz → http://www.example.com/display?category=foo%2Fbar%20baz

Normalization based on URL lists

Some normalization rules may be developed for specific websites by examining URL lists obtained from previous crawls or web server logs. For example, if the URL

http://foo.org/story?id=xyz

appears in a crawl log several times along with

http://foo.org/story_xyz

we may assume that the two URLs are equivalent and can be normalized to one of the URL forms.

Schonfeld et al. (2006) present a heuristic called DustBuster for detecting DUST (different URLs with similar text) rules that can be applied to URL lists. They showed that once the correct DUST rules were found and applied with a canonicalization algorithm, they were able to find up to 68% of the redundant URLs in a URL list.

References

RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
Sang Ho Lee, Sung Jin Kim, and Seok Hoo Hong (2005). "On URL normalization" (PDF). Proceedings of the International Conference on Computational Science and its Applications (ICCSA 2005). pp. 1076–1085. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)
Gautam Pant, Padmini Srinivasan, and Filippo Menczer (2004). "Crawling the Web" (PDF). Web Dynamics: Adapting to Change in Content, Size, Topology and Use, edited by M. Levene and A. Poulovassilis. pp. 153–178. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)
Uri Schonfeld, Ziv Bar-Yossef, and Idit Keidar (2006). "Do not crawl in the dust: different URLs with similar text". Proceedings of the 15th international conference on World Wide Web. pp. 1015–1016. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)
Uri Schonfeld, Ziv Bar-Yossef, and Idit Keidar (2007). "Do not crawl in the dust: different URLs with similar text". Proceedings of the 16th international conference on World Wide Web. pp. 111–120. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)

@@ Line 28: / Line 28: @@
 :<code><nowiki>http://www.example.com/default.asp</nowiki></code> → <code><nowiki>http://www.example.com/</nowiki></code>
 :<code><nowiki>http://www.example.com/a/index.html</nowiki></code> → <code><nowiki>http://www.example.com/a/</nowiki></code>
 * '''Removing the fragment.''' The fragment component of a URL is usually removed.  Example:
 :<code><nowiki>http://www.example.com/bar.html#section1</nowiki></code> → <code><nowiki>http://www.example.com/bar.html</nowiki></code>
 * '''Removing IP.''' Check if the [[IP address]] is the same as its domain name. Example: