Jason in a Nutshell

All about programming and whatever else comes to mind

Posts Tagged ‘HTTP’

HTTP and you

Posted by Jason Baker on May 10, 2009

I was kind of surprised by the number of people told me that they weren’t aware of the differences between HTTP POST and HTTP GET that my last post highlighted.  Not everyone who does web design and/or development has had a formal education on this kind of thing, so I’d like to focus a little bit more on the basics of HTTP.  A full summary of the HTTP protocol would take a couple hundred pages (or 175 to be exact).

In a lot of ways, doing web development and/or design without knowing how this stuff works is a bit like doing Calculus without knowing how addition and subtraction work.  True, you probably won’t ever need it.  But you would be surprised at how many questions can be answered by having a basic understanding of HTTP.

Anatomy of a URI

 

As this helpful diagram of the URI shows, there are 5 basic parts:

  • scheme – This is the protocol that we’re using to access whatever this UR – I represents.  For obvious reasons, we’re only interested in http schemes.
  • username/password – This isn’t really used much in the context of HTTP, but it should be pretty self explanatory.
  • hostname – This essentially tells us what computer we’re accessing.  This can be either an IP address (ex: 209.85.171.100 if you’re using IPv4) or a domain name (google.com).
  • port – This is the port number on the server we’re pulling data from.  In the context of HTTP this will usually be port 80, but occasionally it will be something different.  Also bear in mind that this may be different depending on the scheme (for example, FTP will be port 21 by default).
  • path – This represents where the website “lives” on the server.  It was largely designed for representing files and directories on a file system, but it’s worth mentioning that this part is ultimately little more than arbitrary text that may be interpreted by the server however it wishes.

Anatomy of an HTTP request

When you access my blog via HTTP, your browser sends an HTTP request that looks something like this:

GET / HTTP/1.1 CRLF
Host: jasonmbaker.wordpress.com CRLF
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 CRLF
Connection: close CRLF

Your browser will receive a response that looks something like this (bonus:  there’s one header I left out.  Can you guess which one it is?  I hear there might be job offers if you can figure it out.):

HTTP/1.1 200 OK CRLF 
Server: nginx CRLF
Date: Sun, 10 May 2009 23:16:28 GMT CRLF
Content-Type: text/html; charset=UTF-8 CRLF
Transfer-Encoding: chunked CRLF
Connection: close CRLF
Vary: Cookie CRLF
X-Pingback: https://jasonmbaker.wordpress.com/xmlrpc.php CRLF
CRLF
<!DOCTYPE·html·PUBLIC·"-//W3C//DTD·XHTML·1.0·Transitional//EN"·"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...HTML goes here...

There are two important parts here:  the request/response line and the headers.  Just in case you’re wondering, the CRLF is a special kind of newline.  

The Request line

The request line will usually be in this general form:

<method> <path> HTTP/<version> CRLF

There are three parts to be concerned with :

  • method – the HTTP method we’re using.  A full discussion of all of these methods would be rather lengthy.  The vast majority of webpages are requested using HTTP GET or POST.  I have a full discussion of the differences between these two methods here.
  • path – this is the path to the page we’re requesting. Usually, this is only the path part of the URI and nothing more.  There’s a simple reason for this.  By the time your web browser has connected to my blog, the server presumably already knows that it’s at jasonmbaker.wordpress.com.  Since this isn’t always the case though, this is passed either in the Host header or sometimes in the path depending on circumstances.
  • version – the version of HTTP we’re using.  Usually this will be HTTP 1.0 or 1.1, but you will sometimes run into antiquated HTTP 0.9 clients and servers.

The Response line

The response line will look like this:

HTTP/<version> <response code> CRLF

Here’s how that breaks down:

  • version – The version of HTTP.  See above.
  • response code – This indicates whether the server successfully found the requested page, if there was an error, or if the client needs to be redirected.  If it found the page, it will return 200 OK.  Otherwise, it will return some other code like the infamous 404 Not Found or a 302 Found if there is a redirect to be done.

 

HTTP Headers

An HTTP header will usually be of this form:

<header name>:  <header value> CRLF

Headers are basically just “metadata” about the request.  They include information about the encoding of the data, the browser requesting the page, and the server returning the page.  HTTP was designed to be extensible, so you will frequently run into headers that aren’t specified in the original RFC.

Form Data

Sometimes webpages will require additional data to return a webpage.  There are two ways to do this:  in the query string and in the body of the request.

The query string

In the case of HTTP GET and a couple of other HTTP methods, this data will be passed through the query string.  This request will look something like this:

GET /?page=123 HTTP/1.1 CRLF
Host: jasonmbaker.wordpress.com CRLF
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 CRLF
Connection: close CRLF

The body

HTTP POST requests and all responses will pass data through the body.  An HTTP POST request will look something like this:

POST / HTTP/1.1 CRLF
Host: jasonmbaker.wordpress.com CRLF
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1 CRLF
Connection: close CRLF
CRLF
page=123

Notice that there are two CRLFs between the HTTP headers and the body.

Gotchas

Here are some of the things that will cause problems if you deal with HTTP often enough:

  1. HTTP is selectively case sensitive.  Essentially, HTTP header names are not case sensitive.  This means that a server has to be prepared to treat CONTENT-ENCODING, content-encoding, and cOnTeNt-EnCoDiNg exactly the same.
  2. Slashes on the end DO matter.  For example, http://www.google.com/index.html and http://www.google.com/index.html/ are different URIs.  Unless you’re trying to be tricky, you usually want to make these point to the same thing.
  3. The www matters.  For example, http://www.google.com and http://google.com are not only different URIs, they might even point to different servers.  Usually, people expect these to be the same.
  4. Path handling is harder than it looks.  For example, what happens if I want to join “/2009/05” and “10” to make “/2009/05/10”?  I can’t just concatenate those two strings together because then I would get “2009/0510.”  Nor can I arbitrarily append slashes because then I could end up with something like “/2009/05//10” if I’m not careful.

Conclusion

So, you probably know more about HTTP than you ever wanted to know.  For what it’s worth, HTTP is a bit of an antiquated protocol with a lot of “historical” features.  But it does the job it was intended to do and it does it well.

If you find any inaccuracies, please post them in the comments.  But bear in mind that I intended this to be for a broad audience, so there might be a few points that I oversimplified for the sake of simplicity.  If you want to fill in the holes, there’s not really any other place to look than the HTTP specifications (RFC 1945 for HTTP 1.0 and  RFC 2616 for HTTP 1.1).  If you’re new to HTTP, I’d highly recommend looking at the HTTP 1.0 specification first as it’s about a third as complex as HTTP 1.1.

Posted in Networking, Programming | Tagged: , , | 3 Comments »

Make sure you use the right method

Posted by Jason Baker on April 29, 2009

Pop quiz:

What is the difference between HTTP POST and HTTP GET?

If your guess is that GET sends data via the query string while POST send data in the request’s body, you are… absolutely correct.  And I hate your guts.  Why is this?  It’s not that what you’re saying is wrong by any stretch of the imagination.

Idem-what?

I hate you because while GET and POST do send data differently, that’s not the biggest difference between the two.  The biggest difference is that POST methods are idempotent.  What does this piece of programmer-ese mean?  How many times have you seen something similar to this Chrome form?

 

confirm form resubmission

confirm form resubmission

On poorly designed websites, you probably see it a lot.  Web development newbies will often ask how to disable this page (or its Firefox equivalent).  There’s a really simple answer for this:  you don’t.  That warning is there for a reason.  You will see that warning every time you hit the back button to go to a page that was requested via HTTP POST.

The reason for this is that POST requests aren’t idempotent.  Idempotent is just a fancy word meaning that something won’t make changes.  As the HTTP 1.1 specification says:

In particular, the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval. These methods ought to be considered “safe”. This allows user agents to represent other methods, such as POST, PUT and DELETE, in a special way, so that the user is made aware of the fact that a possibly unsafe action is being requested.

Suppose we’re writing software for a web forum, and we have a URI /forum/submit-message.  If we send data to this form, it will post a message.  Now, imagine that after submitting a message, the user visits /forum/view-message and then hits the back button.  They will go back to /forum/submit-message and send the same message over again if you were to use HTTP GET instead of POST.  However, if you were to use POST, that would be prevented by the maligned “Confirm Form Resubmission” page.  Sometimes annoying things can be useful, eh?

What’s the point?

The point is that nothing is more annoying than a website that uses GET and POST incorrectly.  And you don’t want to lose users by using the wrong one, do you?

Posted in Programming | Tagged: , , , , | Leave a Comment »

URI vs URL

Posted by Jason Baker on May 14, 2008

Ok, so this is one issue that’s been bugging me about HTTP. I keep hearing the acronyms URI and URL mentioned. I knew that URL wasn’t technically accurate, but I couldn’t ever find a good explanation of what the difference between the two are or why URI is more technically accurate. This is even after reading various explanations about the subject. Here’s what I’ve come up with:

URI

A URI is a name that identifies something globally. Admittedly, this explanation is a little bit vague, but then again the idea of a URI is kind of a vague concept. We’ll come back to this later, but I’ll give you a few examples of URIs:

  • http://www.coderspalace.com
  • http://www.coderspalace.com/index.php
  • file://usr/lib/python

URL

A URL is a special kind of URI. It gives you more precise instructions on where something is located. Thus, something like http://www.coderspalace.com/j_baker/ will tell you what computer a webpage is and will even narrow down where the webpage is located, but it won’t give you an exact location of the file like http://www.coderspalace.com/j_baker/index.php will.

So I suppose the next question is: who cares? The point is that nowadays you don’t really need the level of precision that a URL requires and haven’t for a long time. Try going to http://www.coderspalace.com/j_baker/ and http://www.coderspalace.com/j_baker/index.php and see if you get any difference between the two links. You won’t. This is because my webserver is smart enough to know that when you go to http://www.coderspalace.com/j_baker/ you really mean http://www.coderspalace.com/j_baker/index.php. Pretty cool, huh?

Posted in Networking | Tagged: , , , | Leave a Comment »