Jason in a Nutshell

All about programming and whatever else comes to mind

Posts Tagged ‘Networking’

HTTP and you

Posted by Jason Baker on May 10, 2009

I was kind of surprised by the number of people told me that they weren’t aware of the differences between HTTP POST and HTTP GET that my last post highlighted.  Not everyone who does web design and/or development has had a formal education on this kind of thing, so I’d like to focus a little bit more on the basics of HTTP.  A full summary of the HTTP protocol would take a couple hundred pages (or 175 to be exact).

In a lot of ways, doing web development and/or design without knowing how this stuff works is a bit like doing Calculus without knowing how addition and subtraction work.  True, you probably won’t ever need it.  But you would be surprised at how many questions can be answered by having a basic understanding of HTTP.

Anatomy of a URI

 

As this helpful diagram of the URI shows, there are 5 basic parts:

  • scheme – This is the protocol that we’re using to access whatever this UR – I represents.  For obvious reasons, we’re only interested in http schemes.
  • username/password – This isn’t really used much in the context of HTTP, but it should be pretty self explanatory.
  • hostname – This essentially tells us what computer we’re accessing.  This can be either an IP address (ex: 209.85.171.100 if you’re using IPv4) or a domain name (google.com).
  • port – This is the port number on the server we’re pulling data from.  In the context of HTTP this will usually be port 80, but occasionally it will be something different.  Also bear in mind that this may be different depending on the scheme (for example, FTP will be port 21 by default).
  • path – This represents where the website “lives” on the server.  It was largely designed for representing files and directories on a file system, but it’s worth mentioning that this part is ultimately little more than arbitrary text that may be interpreted by the server however it wishes.

Anatomy of an HTTP request

When you access my blog via HTTP, your browser sends an HTTP request that looks something like this:

GET / HTTP/1.1 CRLF
Host: jasonmbaker.wordpress.com CRLF
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 CRLF
Connection: close CRLF

Your browser will receive a response that looks something like this (bonus:  there’s one header I left out.  Can you guess which one it is?  I hear there might be job offers if you can figure it out.):

HTTP/1.1 200 OK CRLF 
Server: nginx CRLF
Date: Sun, 10 May 2009 23:16:28 GMT CRLF
Content-Type: text/html; charset=UTF-8 CRLF
Transfer-Encoding: chunked CRLF
Connection: close CRLF
Vary: Cookie CRLF
X-Pingback: https://jasonmbaker.wordpress.com/xmlrpc.php CRLF
CRLF
<!DOCTYPE·html·PUBLIC·"-//W3C//DTD·XHTML·1.0·Transitional//EN"·"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...HTML goes here...

There are two important parts here:  the request/response line and the headers.  Just in case you’re wondering, the CRLF is a special kind of newline.  

The Request line

The request line will usually be in this general form:

<method> <path> HTTP/<version> CRLF

There are three parts to be concerned with :

  • method – the HTTP method we’re using.  A full discussion of all of these methods would be rather lengthy.  The vast majority of webpages are requested using HTTP GET or POST.  I have a full discussion of the differences between these two methods here.
  • path – this is the path to the page we’re requesting. Usually, this is only the path part of the URI and nothing more.  There’s a simple reason for this.  By the time your web browser has connected to my blog, the server presumably already knows that it’s at jasonmbaker.wordpress.com.  Since this isn’t always the case though, this is passed either in the Host header or sometimes in the path depending on circumstances.
  • version – the version of HTTP we’re using.  Usually this will be HTTP 1.0 or 1.1, but you will sometimes run into antiquated HTTP 0.9 clients and servers.

The Response line

The response line will look like this:

HTTP/<version> <response code> CRLF

Here’s how that breaks down:

  • version – The version of HTTP.  See above.
  • response code – This indicates whether the server successfully found the requested page, if there was an error, or if the client needs to be redirected.  If it found the page, it will return 200 OK.  Otherwise, it will return some other code like the infamous 404 Not Found or a 302 Found if there is a redirect to be done.

 

HTTP Headers

An HTTP header will usually be of this form:

<header name>:  <header value> CRLF

Headers are basically just “metadata” about the request.  They include information about the encoding of the data, the browser requesting the page, and the server returning the page.  HTTP was designed to be extensible, so you will frequently run into headers that aren’t specified in the original RFC.

Form Data

Sometimes webpages will require additional data to return a webpage.  There are two ways to do this:  in the query string and in the body of the request.

The query string

In the case of HTTP GET and a couple of other HTTP methods, this data will be passed through the query string.  This request will look something like this:

GET /?page=123 HTTP/1.1 CRLF
Host: jasonmbaker.wordpress.com CRLF
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 CRLF
Connection: close CRLF

The body

HTTP POST requests and all responses will pass data through the body.  An HTTP POST request will look something like this:

POST / HTTP/1.1 CRLF
Host: jasonmbaker.wordpress.com CRLF
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 CRLF
Connection: close CRLF
CRLF
page=123

Notice that there are two CRLFs between the HTTP headers and the body.

Gotchas

Here are some of the things that will cause problems if you deal with HTTP often enough:

  1. HTTP is selectively case sensitive.  Essentially, HTTP header names are not case sensitive.  This means that a server has to be prepared to treat CONTENT-ENCODING, content-encoding, and cOnTeNt-EnCoDiNg exactly the same.
  2. Slashes on the end DO matter.  For example, http://www.google.com/index.html and http://www.google.com/index.html/ are different URIs.  Unless you’re trying to be tricky, you usually want to make these point to the same thing.
  3. The www matters.  For example, http://www.google.com and http://google.com are not only different URIs, they might even point to different servers.  Usually, people expect these to be the same.
  4. Path handling is harder than it looks.  For example, what happens if I want to join “/2009/05” and “10” to make “/2009/05/10”?  I can’t just concatenate those two strings together because then I would get “2009/0510.”  Nor can I arbitrarily append slashes because then I could end up with something like “/2009/05//10” if I’m not careful.

Conclusion

So, you probably know more about HTTP than you ever wanted to know.  For what it’s worth, HTTP is a bit of an antiquated protocol with a lot of “historical” features.  But it does the job it was intended to do and it does it well.

If you find any inaccuracies, please post them in the comments.  But bear in mind that I intended this to be for a broad audience, so there might be a few points that I oversimplified for the sake of simplicity.  If you want to fill in the holes, there’s not really any other place to look than the HTTP specifications (RFC 1945 for HTTP 1.0 and  RFC 2616 for HTTP 1.1).  If you’re new to HTTP, I’d highly recommend looking at the HTTP 1.0 specification first as it’s about a third as complex as HTTP 1.1.

Posted in Networking, Programming | Tagged: , , | 3 Comments »

URI vs URL

Posted by Jason Baker on May 14, 2008

Ok, so this is one issue that’s been bugging me about HTTP. I keep hearing the acronyms URI and URL mentioned. I knew that URL wasn’t technically accurate, but I couldn’t ever find a good explanation of what the difference between the two are or why URI is more technically accurate. This is even after reading various explanations about the subject. Here’s what I’ve come up with:

URI

A URI is a name that identifies something globally. Admittedly, this explanation is a little bit vague, but then again the idea of a URI is kind of a vague concept. We’ll come back to this later, but I’ll give you a few examples of URIs:

  • http://www.coderspalace.com
  • http://www.coderspalace.com/index.php
  • file://usr/lib/python

URL

A URL is a special kind of URI. It gives you more precise instructions on where something is located. Thus, something like http://www.coderspalace.com/j_baker/ will tell you what computer a webpage is and will even narrow down where the webpage is located, but it won’t give you an exact location of the file like http://www.coderspalace.com/j_baker/index.php will.

So I suppose the next question is: who cares? The point is that nowadays you don’t really need the level of precision that a URL requires and haven’t for a long time. Try going to http://www.coderspalace.com/j_baker/ and http://www.coderspalace.com/j_baker/index.php and see if you get any difference between the two links. You won’t. This is because my webserver is smart enough to know that when you go to http://www.coderspalace.com/j_baker/ you really mean http://www.coderspalace.com/j_baker/index.php. Pretty cool, huh?

Posted in Networking | Tagged: , , , | Leave a Comment »