Jason in a Nutshell

All about programming and whatever else comes to mind

HTTP and you

Posted by Jason Baker on May 10, 2009

I was kind of surprised by the number of people told me that they weren’t aware of the differences between HTTP POST and HTTP GET that my last post highlighted.  Not everyone who does web design and/or development has had a formal education on this kind of thing, so I’d like to focus a little bit more on the basics of HTTP.  A full summary of the HTTP protocol would take a couple hundred pages (or 175 to be exact).

In a lot of ways, doing web development and/or design without knowing how this stuff works is a bit like doing Calculus without knowing how addition and subtraction work.  True, you probably won’t ever need it.  But you would be surprised at how many questions can be answered by having a basic understanding of HTTP.

Anatomy of a URI

 

As this helpful diagram of the URI shows, there are 5 basic parts:

  • scheme – This is the protocol that we’re using to access whatever this UR – I represents.  For obvious reasons, we’re only interested in http schemes.
  • username/password – This isn’t really used much in the context of HTTP, but it should be pretty self explanatory.
  • hostname – This essentially tells us what computer we’re accessing.  This can be either an IP address (ex: 209.85.171.100 if you’re using IPv4) or a domain name (google.com).
  • port – This is the port number on the server we’re pulling data from.  In the context of HTTP this will usually be port 80, but occasionally it will be something different.  Also bear in mind that this may be different depending on the scheme (for example, FTP will be port 21 by default).
  • path – This represents where the website “lives” on the server.  It was largely designed for representing files and directories on a file system, but it’s worth mentioning that this part is ultimately little more than arbitrary text that may be interpreted by the server however it wishes.

Anatomy of an HTTP request

When you access my blog via HTTP, your browser sends an HTTP request that looks something like this:

GET / HTTP/1.1 CRLF
Host: jasonmbaker.wordpress.com CRLF
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 CRLF
Connection: close CRLF

Your browser will receive a response that looks something like this (bonus:  there’s one header I left out.  Can you guess which one it is?  I hear there might be job offers if you can figure it out.):

HTTP/1.1 200 OK CRLF 
Server: nginx CRLF
Date: Sun, 10 May 2009 23:16:28 GMT CRLF
Content-Type: text/html; charset=UTF-8 CRLF
Transfer-Encoding: chunked CRLF
Connection: close CRLF
Vary: Cookie CRLF
X-Pingback: https://jasonmbaker.wordpress.com/xmlrpc.php CRLF
CRLF
<!DOCTYPE·html·PUBLIC·"-//W3C//DTD·XHTML·1.0·Transitional//EN"·"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...HTML goes here...

There are two important parts here:  the request/response line and the headers.  Just in case you’re wondering, the CRLF is a special kind of newline.  

The Request line

The request line will usually be in this general form:

<method> <path> HTTP/<version> CRLF

There are three parts to be concerned with :

  • method – the HTTP method we’re using.  A full discussion of all of these methods would be rather lengthy.  The vast majority of webpages are requested using HTTP GET or POST.  I have a full discussion of the differences between these two methods here.
  • path – this is the path to the page we’re requesting. Usually, this is only the path part of the URI and nothing more.  There’s a simple reason for this.  By the time your web browser has connected to my blog, the server presumably already knows that it’s at jasonmbaker.wordpress.com.  Since this isn’t always the case though, this is passed either in the Host header or sometimes in the path depending on circumstances.
  • version – the version of HTTP we’re using.  Usually this will be HTTP 1.0 or 1.1, but you will sometimes run into antiquated HTTP 0.9 clients and servers.

The Response line

The response line will look like this:

HTTP/<version> <response code> CRLF

Here’s how that breaks down:

  • version – The version of HTTP.  See above.
  • response code – This indicates whether the server successfully found the requested page, if there was an error, or if the client needs to be redirected.  If it found the page, it will return 200 OK.  Otherwise, it will return some other code like the infamous 404 Not Found or a 302 Found if there is a redirect to be done.

 

HTTP Headers

An HTTP header will usually be of this form:

<header name>:  <header value> CRLF

Headers are basically just “metadata” about the request.  They include information about the encoding of the data, the browser requesting the page, and the server returning the page.  HTTP was designed to be extensible, so you will frequently run into headers that aren’t specified in the original RFC.

Form Data

Sometimes webpages will require additional data to return a webpage.  There are two ways to do this:  in the query string and in the body of the request.

The query string

In the case of HTTP GET and a couple of other HTTP methods, this data will be passed through the query string.  This request will look something like this:

GET /?page=123 HTTP/1.1 CRLF
Host: jasonmbaker.wordpress.com CRLF
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 CRLF
Connection: close CRLF

The body

HTTP POST requests and all responses will pass data through the body.  An HTTP POST request will look something like this:

POST / HTTP/1.1 CRLF
Host: jasonmbaker.wordpress.com CRLF
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 CRLF
Connection: close CRLF
CRLF
page=123

Notice that there are two CRLFs between the HTTP headers and the body.

Gotchas

Here are some of the things that will cause problems if you deal with HTTP often enough:

  1. HTTP is selectively case sensitive.  Essentially, HTTP header names are not case sensitive.  This means that a server has to be prepared to treat CONTENT-ENCODING, content-encoding, and cOnTeNt-EnCoDiNg exactly the same.
  2. Slashes on the end DO matter.  For example, http://www.google.com/index.html and http://www.google.com/index.html/ are different URIs.  Unless you’re trying to be tricky, you usually want to make these point to the same thing.
  3. The www matters.  For example, http://www.google.com and http://google.com are not only different URIs, they might even point to different servers.  Usually, people expect these to be the same.
  4. Path handling is harder than it looks.  For example, what happens if I want to join “/2009/05” and “10” to make “/2009/05/10”?  I can’t just concatenate those two strings together because then I would get “2009/0510.”  Nor can I arbitrarily append slashes because then I could end up with something like “/2009/05//10” if I’m not careful.

Conclusion

So, you probably know more about HTTP than you ever wanted to know.  For what it’s worth, HTTP is a bit of an antiquated protocol with a lot of “historical” features.  But it does the job it was intended to do and it does it well.

If you find any inaccuracies, please post them in the comments.  But bear in mind that I intended this to be for a broad audience, so there might be a few points that I oversimplified for the sake of simplicity.  If you want to fill in the holes, there’s not really any other place to look than the HTTP specifications (RFC 1945 for HTTP 1.0 and  RFC 2616 for HTTP 1.1).  If you’re new to HTTP, I’d highly recommend looking at the HTTP 1.0 specification first as it’s about a third as complex as HTTP 1.1.

3 Responses to “HTTP and you”

  1. Joe Schmoe said

    You didn’t leave out a header, in fact, for HTTP 1.1 only the host header is required.

  2. Screwtape said

    In your HTTP POST example, the HTTP verb should probably be “POST”, not “GET”. 🙂

    Also, when talking about the port-number section of the URL, you say “Usually this will be port 80″… but the URL in the diagram is an FTP url. You might want to say that if not supplied, the default is usually determined by the scheme (there’s some schemes that don’t use TCP or UDP, and some URIs aren’t supposed to be resolved at all…)

    Also also, doesn’t an HTTP POST request require a Content-Type header? Not every POST is an HTML form, and things like file-uploads get downright hairy…

  3. Jason Baker said

    @Screwtape – Good points and thank you for catching the GET/POST nit.

    With regards to ports, I was mainly thinking about this with HTTP in mind and didn’t think to mention that this was dependent on the scheme. I’ll update that.

    Lastly, I don’t believe HTTP POST requires a Content-Type header nor can I find that in RFC 2616. Would you be able to point me in the right direction on that?

    UPDATE: Figured out what you mean on this one. HTTP 1.0 requires a Content-Length header for POSTs while HTTP 1.1 doesn’t. 🙂

Leave a comment