Difference between revisions of "url-formats"

From Microformats Wiki
url-formats
Jump to navigation Jump to search
(add Googler terms for parts of a URL)
m (Replace <entry-title> with {{DISPLAYTITLE:}})
 
(9 intermediate revisions by one other user not shown)
Line 1: Line 1:
<entry-title>URL formats</entry-title>
+
{{DISPLAYTITLE:URL formats}}
  
 
URLs are often defined and represented in various systems as a set of various pieces/parts. This page documents the implicit formats from those systems.
 
URLs are often defined and represented in various systems as a set of various pieces/parts. This page documents the implicit formats from those systems.
 +
 +
== Why ==
 +
While similar names are used for the various parts of a URL, it's quite surprising how much variety there is for this fundamental building block of the web.
 +
 +
Why do each of these descriptions of a URL use somewhat different names (and in many cases punctuation boundaries) than the others, and how did this happen?
 +
 +
Perhaps by placing them in a historical order we can make some sense of the evolution of the terminology, which has likely also diverged when adopted by different communities.
  
 
== URL specification ==
 
== URL specification ==
Line 32: Line 39:
 
*** '''scheme'''
 
*** '''scheme'''
 
*** ''':'''
 
*** ''':'''
*** '''relativeURI'''
+
*** '''relativeURI''' (typically also '''net_path''')
**** '''net_path'''
+
**** '''//'''
***** '''//'''
+
**** '''net_loc'''
***** '''net_loc'''
+
**** '''abs_path'''
***** '''abs_path'''
+
***** '''/'''
****** '''/'''
+
***** '''rel_path'''
****** '''rel_path'''
+
****** '''path'''
******* '''path'''
+
******* '''fsegment'''
******** '''fsegment'''
+
******* '''segment''' (zero or more, if present, preceded by '''/''')
******** '''segment''' (zero or more, if present, preceded by '''/''')
+
****** '''params''' (if present, preceded by ''';''')
******* '''params''' (if present, preceded by ''';''')
+
****** '''query''' (if present, preceded by '''?''')
******* '''query''' (if present, preceded by '''?''')
 
 
** '''fragment''' (if present, preceded by '''#''')
 
** '''fragment''' (if present, preceded by '''#''')
  
Line 57: Line 63:
 
* :port is omitted if the port is 80
 
* :port is omitted if the port is 80
 
* empty abs_path is replaced with '''/'''
 
* empty abs_path is replaced with '''/'''
 +
 +
== DOM ==
 +
1996 https://developer.mozilla.org/en/DOM/window.location#Properties
 +
 +
The window.location object represent the URL of the window's page and thus also has properties (terms) for the different parts/pieces.
 +
 +
Properties:
 +
 +
* '''protocol''' - e.g. "http:"
 +
* '''host''' - e.g. "www.example.com:80"
 +
** '''hostname''' - e.g. "www.example.com"
 +
** '''port''' - e.g. "80"
 +
* '''pathname''' - e.g. "/search"
 +
* '''search''' - e.g. "?q=devmo"
 +
* '''hash''' - e.g. "#test"
  
 
== CGI ==
 
== CGI ==
 +
~1997-1999? Common Gateway Interface, specifically, Environment Variables
 +
 
* http://tools.ietf.org/html/rfc3875
 
* http://tools.ietf.org/html/rfc3875
 +
* http://www.citycat.ru/doc/CGI/overview/env.html
 
* http://en.wikipedia.org/wiki/Common_Gateway_Interface - has example:  
 
* http://en.wikipedia.org/wiki/Common_Gateway_Interface - has example:  
  
Line 66: Line 90:
 
Terms:
 
Terms:
 
* '''script-URI'''
 
* '''script-URI'''
** '''scheme''' same as SERVER_PROTOCOL
+
** '''scheme'''
 
** '''://'''
 
** '''://'''
 
** '''server-name''' - SERVER_NAME
 
** '''server-name''' - SERVER_NAME
Line 75: Line 99:
 
** '''?'''
 
** '''?'''
 
** '''query-string''' - QUERY_STRING
 
** '''query-string''' - QUERY_STRING
 
  
 
Environment variables:
 
Environment variables:
 
* '''SERVER_PROTOCOL''' - not the protocol scheme, e.g. "HTTP/1.1"
 
* '''SERVER_PROTOCOL''' - not the protocol scheme, e.g. "HTTP/1.1"
* '''HTTP_HOST''' - e.g "example.com"
+
* '''SERVER_NAME''' or '''HTTP_HOST''' - e.g "example.com"
 
* '''SERVER_PORT''' - e.g. "80"
 
* '''SERVER_PORT''' - e.g. "80"
* '''REMOTE_USER'''
+
* '''REMOTE_USER''' - the username (but not password)
 
* '''PATH''' - not the URL path, but to the web server on the system
 
* '''PATH''' - not the URL path, but to the web server on the system
 
* '''REQUEST_URI''' - e.g. "/cgi-bin/printenv.pl/ponylove?q=20%C001er&moar=kitties"
 
* '''REQUEST_URI''' - e.g. "/cgi-bin/printenv.pl/ponylove?q=20%C001er&moar=kitties"
** '''SCRIPT_NAME''' - e.g. "/cgi-bin/printenv.pl"
+
** '''SCRIPT_NAME''' - e.g. "/cgi-bin/printenv.pl" (first two segments?)
** '''PATH_INFO''' - e.g. "/ponylove"
+
** '''PATH_INFO''' - e.g. "/ponylove" (remainder of path)
 
** '''QUERY_STRING''' - e.g. "q=20%C001er&moar=kitties"
 
** '''QUERY_STRING''' - e.g. "q=20%C001er&moar=kitties"
  
== DOM ==
+
== Python 2 ==
* https://developer.mozilla.org/en/DOM/window.location#Properties
+
2000[http://en.wikipedia.org/wiki/Python_%28programming_language%29#History] Python 2 urlparse
 +
* http://docs.python.org/library/urlparse.html
  
The window.location object represent the URL of the window's page and thus also has properties (terms) for the different parts/pieces.
+
Attributes
 +
* '''scheme''' - e.g. "http"
 +
* '''netloc''' - e.g. "www.cwi.nl:80"
 +
** '''username'''
 +
** '''password'''
 +
** '''hostname'''
 +
** '''port'''
 +
* '''path''' - e.g. "/%7Eguido/Python.html"
 +
* '''params''' (if present, preceded by ''';''')
 +
* '''query''' (if present, preceded by '''?''')
 +
* '''fragment''' (if present, preceded by '''#''')
  
Properties:
+
== URI specification ==
 +
2005 URI Generic Syntax
 +
* http://www.ietf.org/rfc/rfc3986.txt with example:
 +
<nowiki>foo://example.com:8042/over/there?name=ferret#nose</nowiki>
  
* '''protocol''' - e.g. "http:"
+
* '''scheme''' - e.g. "foo"
* '''host''' - e.g. "www.example.com:80"
+
* '''":"'''
** '''hostname''' - e.g. "www.example.com"
+
* '''hier-part''' - e.g.  
** '''port''' - e.g. "80"
+
** '''"//"'''
* '''pathname''' - e.g. "/search"
+
** '''authority''' - e.g. "example.com:8042"
* '''search''' - e.g. "?q=devmo"
+
** '''path''' - e.g. "/over/there"
* '''hash''' - e.g. "#test"
+
*** '''path-abempty''' or
 +
*** '''path-absolute''' or
 +
*** '''path-rootless''' or
 +
*** '''path-empty'''
 +
* '''query''' (if present, preceded by '''"?"''') e.g. "name=ferret"
 +
* '''fragment''' (if present, preceded by  "#") e.g. "nose"
  
 
== Googler ==
 
== Googler ==
Per Matt Cutts's blog post <cite>[http://www.mattcutts.com/blog/seo-glossary-url-definitions/ Talk like a Googler: parts of a url]</cite>: of for example:
+
2007 Per Matt Cutts's blog post <cite>[http://www.mattcutts.com/blog/seo-glossary-url-definitions/ Talk like a Googler: parts of a url]</cite>: of for example:
  
 
<nowiki>http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s</nowiki>
 
<nowiki>http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s</nowiki>

Latest revision as of 16:33, 18 July 2020


URLs are often defined and represented in various systems as a set of various pieces/parts. This page documents the implicit formats from those systems.

Why

While similar names are used for the various parts of a URL, it's quite surprising how much variety there is for this fundamental building block of the web.

Why do each of these descriptions of a URL use somewhat different names (and in many cases punctuation boundaries) than the others, and how did this happen?

Perhaps by placing them in a historical order we can make some sense of the evolution of the terminology, which has likely also diverged when adopted by different communities.

URL specification

The URL specification is perhaps the most canonical source for the names of the different parts of a URL.

1994 http://www.w3.org/Addressing/URL/url-spec.txt

Names are quoted literally, dropping any "The" prefix and "part" suffix.

  • PrePrefix - e.g. "URL:". The portion before the "http".
  • Scheme - e.g. "http"
  • :
  • Internet protocol parts
    • // (until the following /)
    • user name (if present, followed by an @ after optional password (see next field)).
    • password (if present, preceded by a :)
    • internet domain name - e.g. "www.w3.org"
    • port number (if present, preceded by a :)
  • Path
    • search
  • fragmentid - "the hash sign and following"

HTTP

The HTTP specification has a few notes about the format/portions of HTTP URLs.

1996 http://www.ietf.org/rfc/rfc1945.txt - 3.2.1 General Syntax

  • URI
    • absoluteURI
      • scheme
      • :
      • relativeURI (typically also net_path)
        • //
        • net_loc
        • abs_path
          • /
          • rel_path
            • path
              • fsegment
              • segment (zero or more, if present, preceded by /)
            • params (if present, preceded by ;)
            • query (if present, preceded by ?)
    • fragment (if present, preceded by #)

Also:

  • http_URL
    • http://
    • host
    • port (if present, preceded by :)
    • abs_path (as defined above)

Canonicalization:

  • host is lowercased
  • :port is omitted if the port is 80
  • empty abs_path is replaced with /

DOM

1996 https://developer.mozilla.org/en/DOM/window.location#Properties

The window.location object represent the URL of the window's page and thus also has properties (terms) for the different parts/pieces.

Properties:

  • protocol - e.g. "http:"
  • host - e.g. "www.example.com:80"
    • hostname - e.g. "www.example.com"
    • port - e.g. "80"
  • pathname - e.g. "/search"
  • search - e.g. "?q=devmo"
  • hash - e.g. "#test"

CGI

~1997-1999? Common Gateway Interface, specifically, Environment Variables

http://example.com/cgi-bin/printenv.pl/ponylove?q=20%C001er&moar=kitties

Terms:

  • script-URI
    • scheme
    • ://
    • server-name - SERVER_NAME
    • :
    • server-port - SERVER_PORT
    • script-path same as SCRIPT_NAME
    • extra-path same as PATH_INFO
    • ?
    • query-string - QUERY_STRING

Environment variables:

  • SERVER_PROTOCOL - not the protocol scheme, e.g. "HTTP/1.1"
  • SERVER_NAME or HTTP_HOST - e.g "example.com"
  • SERVER_PORT - e.g. "80"
  • REMOTE_USER - the username (but not password)
  • PATH - not the URL path, but to the web server on the system
  • REQUEST_URI - e.g. "/cgi-bin/printenv.pl/ponylove?q=20%C001er&moar=kitties"
    • SCRIPT_NAME - e.g. "/cgi-bin/printenv.pl" (first two segments?)
    • PATH_INFO - e.g. "/ponylove" (remainder of path)
    • QUERY_STRING - e.g. "q=20%C001er&moar=kitties"

Python 2

2000[1] Python 2 urlparse

Attributes

  • scheme - e.g. "http"
  • netloc - e.g. "www.cwi.nl:80"
    • username
    • password
    • hostname
    • port
  • path - e.g. "/%7Eguido/Python.html"
  • params (if present, preceded by ;)
  • query (if present, preceded by ?)
  • fragment (if present, preceded by #)

URI specification

2005 URI Generic Syntax

foo://example.com:8042/over/there?name=ferret#nose

  • scheme - e.g. "foo"
  • ":"
  • hier-part - e.g.
    • "//"
    • authority - e.g. "example.com:8042"
    • path - e.g. "/over/there"
      • path-abempty or
      • path-absolute or
      • path-rootless or
      • path-empty
  • query (if present, preceded by "?") e.g. "name=ferret"
  • fragment (if present, preceded by "#") e.g. "nose"

Googler

2007 Per Matt Cutts's blog post Talk like a Googler: parts of a url: of for example:

http://video.google.co.uk:80/videoplay?docid=-7246927612831078230&hl=en#00h02m30s

Parts of a url:

  • protocol - e.g. "http"
  • host or hostname - e.g. "video.google.co.uk"
    • subdomain - e.g. "video"
    • domain name - e.g. "google.co.uk"
    • top-level domain or TLD - e.g. "uk" (which in this case is also referred to as a country-code top-level domain or ccTLD.
  • port - e.g. "80"
  • path - e.g. "/videoplay"
  • parameters - e.g. "?docid=-7246927612831078230&hl=en"
    • parameter - e.g. "docid" with value "-7246927612831078230"
  • fragment or named anchor - e.g. "#00h02m30s"

related