HTTP and you
Posted by Jason Baker on May 10, 2009
I was kind of surprised by the number of people told me that they weren’t aware of the differences between HTTP POST and HTTP GET that my last post highlighted. Not everyone who does web design and/or development has had a formal education on this kind of thing, so I’d like to focus a little bit more on the basics of HTTP. A full summary of the HTTP protocol would take a couple hundred pages (or 175 to be exact).
In a lot of ways, doing web development and/or design without knowing how this stuff works is a bit like doing Calculus without knowing how addition and subtraction work. True, you probably won’t ever need it. But you would be surprised at how many questions can be answered by having a basic understanding of HTTP.
Anatomy of a URI
As this helpful diagram of the URI shows, there are 5 basic parts:
- scheme – This is the protocol that we’re using to access whatever this UR – I represents. For obvious reasons, we’re only interested in http schemes.
- username/password – This isn’t really used much in the context of HTTP, but it should be pretty self explanatory.
- hostname – This essentially tells us what computer we’re accessing. This can be either an IP address (ex: 18.104.22.168 if you’re using IPv4) or a domain name (google.com).
- port – This is the port number on the server we’re pulling data from. In the context of HTTP this will usually be port 80, but occasionally it will be something different. Also bear in mind that this may be different depending on the scheme (for example, FTP will be port 21 by default).
- path – This represents where the website “lives” on the server. It was largely designed for representing files and directories on a file system, but it’s worth mentioning that this part is ultimately little more than arbitrary text that may be interpreted by the server however it wishes.
Anatomy of an HTTP request
When you access my blog via HTTP, your browser sends an HTTP request that looks something like this:
GET / HTTP/1.1 CRLF Host: jasonmbaker.wordpress.com CRLF User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 CRLF Connection: close CRLF
Your browser will receive a response that looks something like this (bonus: there’s one header I left out. Can you guess which one it is? I hear there might be job offers if you can figure it out.):
HTTP/1.1 200 OK CRLF Server: nginx CRLF Date: Sun, 10 May 2009 23:16:28 GMT CRLF Content-Type: text/html; charset=UTF-8 CRLF Transfer-Encoding: chunked CRLF Connection: close CRLF Vary: Cookie CRLF X-Pingback: https://jasonmbaker.wordpress.com/xmlrpc.php CRLF CRLF <!DOCTYPE·html·PUBLIC·"-//W3C//DTD·XHTML·1.0·Transitional//EN"·"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> ...HTML goes here...
There are two important parts here: the request/response line and the headers. Just in case you’re wondering, the CRLF is a special kind of newline.
The Request line
The request line will usually be in this general form:
<method> <path> HTTP/<version> CRLF
There are three parts to be concerned with :
- method – the HTTP method we’re using. A full discussion of all of these methods would be rather lengthy. The vast majority of webpages are requested using HTTP GET or POST. I have a full discussion of the differences between these two methods here.
- path – this is the path to the page we’re requesting. Usually, this is only the path part of the URI and nothing more. There’s a simple reason for this. By the time your web browser has connected to my blog, the server presumably already knows that it’s at jasonmbaker.wordpress.com. Since this isn’t always the case though, this is passed either in the Host header or sometimes in the path depending on circumstances.
- version – the version of HTTP we’re using. Usually this will be HTTP 1.0 or 1.1, but you will sometimes run into antiquated HTTP 0.9 clients and servers.
The Response line
The response line will look like this:
HTTP/<version> <response code> CRLF
Here’s how that breaks down:
- version – The version of HTTP. See above.
- response code – This indicates whether the server successfully found the requested page, if there was an error, or if the client needs to be redirected. If it found the page, it will return 200 OK. Otherwise, it will return some other code like the infamous 404 Not Found or a 302 Found if there is a redirect to be done.
An HTTP header will usually be of this form:
<header name>: <header value> CRLF
Headers are basically just “metadata” about the request. They include information about the encoding of the data, the browser requesting the page, and the server returning the page. HTTP was designed to be extensible, so you will frequently run into headers that aren’t specified in the original RFC.
Sometimes webpages will require additional data to return a webpage. There are two ways to do this: in the query string and in the body of the request.
The query string
In the case of HTTP GET and a couple of other HTTP methods, this data will be passed through the query string. This request will look something like this:
GET /?page=123 HTTP/1.1 CRLF Host: jasonmbaker.wordpress.com CRLF User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 CRLF Connection: close CRLF
HTTP POST requests and all responses will pass data through the body. An HTTP POST request will look something like this:
POST / HTTP/1.1 CRLF Host: jasonmbaker.wordpress.com CRLF User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 CRLF Connection: close CRLF CRLF page=123
Notice that there are two CRLFs between the HTTP headers and the body.
Here are some of the things that will cause problems if you deal with HTTP often enough:
- HTTP is selectively case sensitive. Essentially, HTTP header names are not case sensitive. This means that a server has to be prepared to treat CONTENT-ENCODING, content-encoding, and cOnTeNt-EnCoDiNg exactly the same.
- Slashes on the end DO matter. For example, http://www.google.com/index.html and http://www.google.com/index.html/ are different URIs. Unless you’re trying to be tricky, you usually want to make these point to the same thing.
- The www matters. For example, http://www.google.com and http://google.com are not only different URIs, they might even point to different servers. Usually, people expect these to be the same.
- Path handling is harder than it looks. For example, what happens if I want to join “/2009/05″ and “10” to make “/2009/05/10″? I can’t just concatenate those two strings together because then I would get “2009/0510.” Nor can I arbitrarily append slashes because then I could end up with something like “/2009/05//10″ if I’m not careful.
So, you probably know more about HTTP than you ever wanted to know. For what it’s worth, HTTP is a bit of an antiquated protocol with a lot of “historical” features. But it does the job it was intended to do and it does it well.
If you find any inaccuracies, please post them in the comments. But bear in mind that I intended this to be for a broad audience, so there might be a few points that I oversimplified for the sake of simplicity. If you want to fill in the holes, there’s not really any other place to look than the HTTP specifications (RFC 1945 for HTTP 1.0 and RFC 2616 for HTTP 1.1). If you’re new to HTTP, I’d highly recommend looking at the HTTP 1.0 specification first as it’s about a third as complex as HTTP 1.1.