PHP - Form handling

From LXF Wiki

Table of contents

Form handling

(Original version written by Paul Hudson for LXF issue 65.)

We gets back to PHP basics, but show that using HTML forms isn't as straightforward as you might think...


Once you start programming PHP, you start to feel like your time on HTML was a wasted youth. As it is interpreted client-side, HTML gets mangled in any number of different ways, most of which are inevitably not what you wanted. As a result, you can easily pass hours and even days in a struggle between standards and their various interpretations in web browsers.

PHP is a breath of fresh air: while HTML "programmers" use many standards before they die, PHP programmers taste of standards but once. That said, we're going to go back to HTML with a positive attitude: we plan to use it as little as possible, after all.


Form fundamentals

HTML is a very simple way of representing web pages. It is stateless, typeless, and, for anything beyond the absolute basics, useless. However, for forms it works well, and we need only focus on "stateless" and "typeless" - stateless means that one HTML page cannot remember any saved settings or states from another HTML page (or even itself), and typeless means that all HTML data is treated as text, regardless of what users enter in.

Having a typeless system makes data validation slightly trickier. For example, if we want to allow users to enter their age into a form, we want to ensure they don't enter "elephant" as their age. So, the check is simple: we use a function like PHP's is_int(), right? Wrong. HTML data is always treated as text, which means that even even an age of 49 would be considered a string, and thus fail the is_int() test. There is a solution: PHP has a companion function called is_numeric(), which returns true if a variable is an integer or a string masquerading as an integer.

Let's take a look at the first script example: we want a form that accepts a user's name and age, then checks that the age is numeric and between 18 and 30:

<?php
   if (isset($_POST)) {
      if (($_POST['Age'] > 18) && ($_POST['Age'] < 30)) {
         // test successful...
      } else {
         echo "Sorry, no Club 18-30 for you!<br />";
      }
   }
?>

<form method="post">
Name: <input type="text" name="Name" /><br />
Age: <input type="text" name="Age" /><br />
<input type="submit">
</form>

Of course, that's all very well, but there aren't many forms that can be summed up in just two fields. Writing validation well adds a fair amount of bulk to your code, so what we should really be aiming for is a generic, re-usable validation system that doesn't increase in size if the form is larger.


Array of light

We can fulfil our plan by sending the form as a single array, which you may not have done before. Arrays in HTML forms are defined in the same way as PHP arrays: with the [ ] and symbols. We can rewrite the name/age form with an array like this:

<?php
	if (isset($_POST)) {
		$form = $_POST['form'];
		if (($form['Age'] > 18) && ($form['Age'] < 30)) {
			// test successful...
		} else {
			echo "Sorry, no Club 18-30 for you!<br />";
		}
	}
?>

<form method="post">
Name: <input type="text" name="form[Name]" /><br />
Age: <input type="text" name="form[Age]" /><br />
<input type="submit">
</form>

We can take that a step further by prefixing each of the form variables with specific letters, eg "r" for "required", "i" for "treat as integer", "s" for "treat as string", etc. We can then put an underscore between the prefix and the variable name, then go through each character in the prefix and ensure the value provided matches the requirements. Sound complicated? It's not, and it's actually very flexible. Here is the - admittedly long - code:

<?php
   if (isset($_POST)) {
      $errors = array();
      foreach($_POST['form'] as $var => $val) {
         // split the variable name
         $name = explode("_", $var);
         if (count($name) == 2) {
            // this has a prefix
            $parseblock = $name[0];
            $varname = $name[1];
            $parselen = strlen($parseblock);
            for ($i = 0; $i < $parselen; ++$i) {
               // loop through each letter in the prefix
               switch ($parseblock{$i}) {
                  case "r":
                     // field is required
                     if (!strlen($val)) {
                        $errors[] = "<strong>$varname</strong> is a required field";
                        break 2;
                     }
                     break;
                  case "i":
                     // field should be int
                     if (!is_numeric($val)) {
                        $errors[] = "<strong>$varname</strong> cannot be set to <em>$val</em>";
                        break 2;
                     }
                     break;
                  case "s":
                     // field should be string
                     if (!is_string($val)) {
                        $errors[] = "<strong>$varname</strong> cannot be set to <em>$val</em>";
                        break 2;
                     }
                     break;
               }
            }
         } else {
            // this has no prefix
            $varname = $name[0];
         }
      }
      if (count($errors)) {
         // there were validation errors!
         foreach($errors as $val) {
            echo "$val<br />";
         }
      }
   }
?>
<form method="post">
Name: <input type="text" name="form[rs_Name]" /><br />
Age: <input type="text" name="form[ri_Age]" /><br />
Sex: <input type="text" name="form[Sex]" /><br />
<input type="submit" value="Go" />
</form>

There are comments in the PHP code explain what it's up to, but note that the HTML form shows off two required files, one string-only field, one numbers-only field, and one (Sex) that has no prefix at all that also works fine.


Power validation

There's a lot more to data validation than just is_int() and is_numeric(), and it's all neatly encapsulated into the Ctype family of functions. These functions have been bundled and auto-compiled by default since PHP 4.2, but if you're stuck on an older version (upgrade, already!) then you should use --enable-ctype on your configure line.

There are Ctype equivalents for is_int() and is_numeric(): ctype_digit() and ctype_alnum(), with "alnum" simply being an abbreviation for "alphanumeric". However, it goes a long way beyond that - you get ctype_alpha(), which checks exclusively for alphabetic characters, ctype_cntrl(), which checks for control characters (eg line breaks), ctype_lower(), which checks only for lowercase characters, ctype_punct(), which checks for characters like ! and ?, and ctype_space(), which checks for whitespace. Each of these only return true if the string provided as their only parameter is made up of entirely the class of characters they are checking for.

Despite being more powerful than the regular PHP functions, the Ctype family are actually about twice as fast when the value needs to be inspected. For example, is_int() works simply be returning the internal data type of the variable, without bothering to check its value, and so is faster than its Ctype equivalent. However, is_numeric() checks the contents of the variable to ensure it is numeric, and so runs slower than Ctype.

Here are some example uses of the Ctype functions:

<?php
  $string1 = "Hello, world";
  $string2 = "l23456";

  echo "First: ", ctype_alnum($string1), "\n";
  echo "Second: ", ctype_print($string2), "\n";
  echo "Third: ", ctype_digit($string2), "\n";
?>

Before you continue, have a think about what that script will output - it's not as easy as it first looks. What you'll get is this:

First:
Second: 1
Third:

So, what we're seeing there is the first and third check returning nothing (ie, they returned "false"), with only the second check returning true. The first check fails because "Hello, world" contains a space and a comma - both of which aren't alphanumeric characters. The third check fails because $string2 is set to l23456 - the first character there is a lowercase L as opposed to a 1. Sneaky, yes, but Ctype spotted a character difference that most humans would have missed.


Visual aids

Filling out a form is a daunting task for the majority of people, and it's your job to minimize that. If you're designing an online shop, take a good look at how Amazon.com does it - they have invested millions in user interface design and it really shows through the fact that they make every effort to re-assure users that they are in control.

For example, if you have your form split across several pages you should let people know how far into the process they are and how much further remains. Even better would be to allow users to click on a previous step and return to it so they can make changes. A knock-on change to this is to make it clear when payment happens, so that people don't think "I don't want to click Next because I haven't specified the seat I want/paint colour/etc".

Another smart move is to make it clear which fields are required, either by adding a star, red text, or, clearest of all, the word "required" after the fields. If readers don't provide a required field, don't just say "You missed out a field" - working on your site shouldn't be a guessing game. Instead, something like "field XYZ is required" is better.

Dealing with passwords is never easy, but you take away some of the sting by asking users to enter their password twice, and also by placing HTML text length restrictions on the entry box. If your database only accepts passwords of 10 characters and under, why let people enter 12? When entering a 12-character password, users will think the DB has stored the full length - then wonder why they can't login later.

The solution is to use a fixed-length box so that they know they are being forced to fit into a certain length. Keep in mind that most password entry boxes use asterisks rather than show the text, which means that if the actual width of the password box isn't sufficient to see all the asterisks in their password at the same time, they probably won't realise that their typing is being ignored.

Anyone with a little experience designing web forms will know they inevitably don't fit into the layout of the rest of the site. The rule of thumb here is to be sacrificers, but not butchers - with work it's possible to make even the most complex form look good and fit into your site design; don't be afraid to break them up across pages as necessary.


No file fooling

Handling file uploads is not a simple task. Yes, the code to do it is easy enough, but it is easy to make a mistake in your code and render your site open to attack. The problem is that allowing just anybody to upload files to your site means that malicious users could easily send you hundreds of megabytes of data - chewing up your bandwidth, disk space, and even your money if you pay for your bandwidth by the gigabyte.

Here's the most basic file upload script:

<?php
  if (isset($_FILES)) {
    var_dump($_FILES);
  }
?>

<form enctype="multipart/form-data" method="post">
Send file: <input type="file" name="firstfile" /><br />
<input type="submit" />
</form>

Note that there's now an "enctype" attribute for the form: without that, the file will not upload correctly and the script won't work. It has an arbitrary file name, "firstfile", and when the form gets submitted we output the contents of the $_FILES superglobal array. Nothing really happens there, but the output is worth looking over closely as it shows you exactly what information you get about each file:

array(1) {
  ["somefile"]=> array(5) {
    ["name"]=> string(17) "myfile.txt"
    ["type"]=> string(24) "text/plain"
    ["tmp_name"]=> string(20) "/var/tmp/php44.tmp"
    ["error"]=> int(0)
    ["size"]=> int(237274)
  }
}

So, the $_FILES array contains just one element, "somefile", which itself is an array that contains five other elements: name, type, tmp_name, error, and size. The "name" entry is what the file was originally called on the user's system, before upload, and "type" is the MIME type for the file. The "tmp_name" element is where the file has been uploaded to on your server, and you can set the default location from your php.ini file. Moving on, "error" is set to 0 if the upload went OK (or 1 otherwise), and "size" is the size of the file that was uploaded, in bytes - divide by 1024 to get kilobytes.

The crucial parts in there are name and tmp_name: what the file was called before upload, and what the file is called on your server. The next step is to move the file from its temporary location to a more permanent place, and the best way to accomplish that is with the move_uploaded_file() function. As the name implies, it only moves files that were legitimately uploaded, which means that it's useful /and/ secure - it will ignore any files that were not uploaded through a PHP script, which avoids copycat attacks. To use this function, give it the filename to move as its first parameter, and the new location as its second, like this:

if (isset($_FILES)) {
  foreach($_FILES as $file) {
    move_uploaded_file($file['tmp_name'], '/var/www/uploads/' . $file['name']);
  }
}

Of course, that script has a fatal flaw - can you spot it? The problem is that we use the input filename as part of the final filename, which means two people that upload a file called "myfile.txt" will collide - the new file will overwrite the old file. This isn't a desirable situation, so the best solution here is to place each file in a subtly different location - perhaps you could append a random number to each filename (maybe the time it was uploaded?), or if you want to preserve the exact original filename you can use random subdirectories. If you choose the latter option, make sure you keep a database around to save information about what files went where.

The other alternative is to namespace files by username of uploaded, which would mean /var/www/uploads/bob/upload.txt, /var/www/uploads/tim/upload.txt, etc. If the same user then tries to upload a duplicate filename, you can prompt them "are you sure you want to overwrite the existing filename?" Wherever you choose to place your uploaded files, make sure that Apache has permissions to read and write there, otherwise your plans will gan aft agley.


The Pros and Cons of Client-side Validation

PHP's ability to validate input on the server should not be treated as the last word on data validation. Many forms have errors in them - not because people are stupid, but simply that they make typing errors or just don't understand what you're asking for.

If, having checked a form, you find it's missing a required field, you need to send it back to the reader for further input - they need to do more work, and then resubmit it for further validation. This means your machine needs to validate their form twice, if not more - hardly a valuable use of resources. A better solution is to add some client-side JavaScript that runs many or all of the same checks your server script will run. While this won't work on all browsers (some don't have JavaScript installed; others disable it), every little bit helps to reduce server load.

Stone tablet background? Scroll?

The Ten Commandments of Data Validation

  • Thou shalt not assume users are trustworthy
  • Thou shalt mark required fields clearly
  • Thou shalt always prefill fields if forms must be re-entered
  • Thou shalt always check data is within a valid range
  • Thou shalt ask users to retype important fields, such as passwords
  • Thou shalt place size and length restrictions on input fields
  • Thou shalt never forget to check data is of the correct type
  • Thou shalt never accept data from an unknown referrer
  • Thou shalt never use HTTP GET for sensitive information
  • Thou shalt only allow trusted users to upload files

== New website ==!

At last, the LXF website is no longer the slowest site on the Internet - it has been tweaked in various places so that it's at least the second slowest now. That said, we are upgrading the server to something much, much faster (approximately a 16x speed boost) and there's a redesign in the works that we hope to see soon. So, "goodbye" to double posting, "so long" to random disconnections, and let's all hope it's also the end of the eye-scalding orange colouring...

lxfsite.png-thumb.png (http://www.linuxformat.co.uk/images/wiki/lxfsite.png)
It's still ugly, but it's at least a smidge faster - we're working on the colour, honest!

Validate this

The Darwin awards of data validation

Many men have tried to validate all the data that comes their way. Sadly, they tried and died - or at least pulled out large chunks of their hair. For example, although it might seem easy to validate URLs and email addresses, we strongly recommend you avoid it beyond the absolute basics. That is, a URL should have at least one full stop/period in, and an email address should have a full stop and an @ sign. Beyond that lies madness, and dragons, too.

The problem with validation of these two types of data (and others like them) is that their input is ever-changing. For example, lots of people's websites break when new domain names get introduced like ".info", simply because they were checking for a TLD (top-level domain) that was three letters or under. Similarly, email addresses allow all sorts of characters that most people just don't use, like +, and yet lots of scripts will reject such addresses.