Working with files

From LXF Wiki

Namespaces and object-oriented programming are boring, so Paul Hudson gets down to some real programming and make something useful...

Did you ever watch ThunderCats when you were a kid? I thought it was great at the time - you really could "feel the magic, hear the roar" from your TV set, because the animation was cool, the characters were likeable, and the scripts were exciting. Looking back on it now, though, I realise it was more than a bit formulaic. The ThunderCats included people such as Lion-O, Tygra, Panthro and Cheetara, and were fighting against a monkey-like man called Monkian, a jackal-like man called Jacklman and a bird-like man called Vultureman. Their home world was Thundera. For fuel, they used Thundrillium. Panthro's car was called the ThunderClaw. Do you see the pattern yet?

This predictability might seem rather dull, but it does make it very easy for kids to grasp what's going on, as well as making it easy to remember when talking about it to their friends. Now that I'm all grown up, two things are clear. First, I'm not going to be an astronaut. I would have gone for it, but, you know, I don't think they let you in if you wear glasses. Second, the best way to learn something is to make it memorable. To give you an example, I've been using PHP for years now, and I still can't remember off the top of my head whether strpos() (the function to locate one strings position within another) takes its parameters as $needle, $haystack or as $haystack, $needle.

PHP's problem is that strpos() takes its parameters as $haystack, $needle, whereas in_array() (the function that tells you whether an item is in an array) takes its parameters as $needle, $haystack. Similarly, strpos() is all one word, whereas str_replace() has an underscore between the two parts. In essence, PHP is not a very memorable programming language. PHP programmers are never going to be ThunderCats.

C# - being substantially newer and less full of brainfarts than PHP - has none of these problems. In fact, some parts of C# are so straightforward that you can almost program them by guessing at method names. This issue we're going to be looking at just such an area: file manipulation. We're going to make a program that searches a filesystem and stores an index of each file that exists so that we can print out . Don't worry if this sounds hard - C#, Mono and .NET will be doing the hard work.


Table of contents

Feel the magic

Let's get started with some basic file reading and writing. Crank up MonoDevelop and create a new C# Console project (File > New Project > C# > Console Project). We're going to be using this project for our final project, so give it a name you wouldn't be embarrassed about if you saw it on SourceForge. I've chosen "Snarf", because it has the meaning "to take" (this program will read the content of lots of files) and also cunningly ties in with the ThunderCats theme.

We're going to be using some of the new .NET 2.0 functionality for this project, and most MonoDevelop versions ship using .NET 1.1 as standard. To change this, click Project > Options, then select Runtime Options from the Categories list of the window that appears. On the right you'll be able to coose between runtime version 1.1 or 2.0, so change it to 2.0. While you're in this window, open up the Configurations > Debug category, select Output, then change the "Output path" option so that it removes the /bin/Debug section. This will tell MonoDevelop to save your executable file in the root directory of your project. Click OK to save your changes.

File reading and writing is all handled with System.IO library (short for input/output, ie reading/writing), which means you'll need to put "using System.IO" at the top of your project's main code file. Change the "class MainClass" line to read "class Snarf", then delete the Console.WriteLine(), leaving the Main() method empty.

The first thing we're going to do is read the contents of a single file. File contents are - at least as far as we're interested - plain text, meaning that they can be stored easily in a string data type. Create a file called myfile.txt in the root directory of your project (where MonoDevelop will save your program), and enter some text in it.

Now, the real question is: how do we get the contents of a file into our string? Well, say it with me: "thunder, thunder, thunder, ThunderCats! HoooOOOooooo!":

string myfile = File.ReadAllText("myfile.txt");

That's it: that's all it takes to read the contents of a file into a string using Mono. To print it out, you can use the Console.Write() method we looked at last issue, like this:

Console.Write(myfile);

Hit F5 to compile and run the program, and you should see the contents of your file printed in the Application Output pane at the bottom of MonoDevelop.


Hello, Operator?

Let's push our program a bit further, and make it write some changes back to our file. Add these two files after the Console.Write() call:

myfile += "\nMy wings are like a shield of... oh wait, wrong cartoon.";
File.WriteAllText("myfile.txt", myfile);

+= is an operator, which is a fancy word to describe a symbol that performs an operation. For example, you already know "+". In geekspeak, that's the addition operator, meaning that it takes two numbers (formally called "operands") and adds them together. Similarly, - is the subtraction operator, * is the multiplication operator and / is the division operator. = is the assignment operator, which means it takes one operand (let's say "10") and copies its value into another (let's say the variable "a"). So, "a = 10" sets the variable "a" to have the value "10".

Yes, I know this is all primary school stuff, but bear with me: here comes the pay off. If a is 10, how can you add another 10 to a? Here's what you probably thought of...

a = a + 10;

That works. In fact, according to Occam's Razor ("all things being equal, the simplest explanation is the best one"), that code is the best. But a little-known addition to that principle - known as Occam's /eraser/ - is that the simplest solution is bound to have something wrong with it. In this case, the problem is that the line of code takes 11 keypresses to type, when C# lets us do exactly the same thing in just 8:

a += 10;

+= is the illegitimate lovechild of + (add things together) and = (assign one thing to another), and adds whatever is on the right to the existing value of whatever is on the left. Snarf!

With your newfound knowledge, you should know be able to see that the new code for our program adds one string to whatever is in the previous string. C# is smart enough to know when += is used on numbers (ie, it should add the two numbers together) and when it's used on strings (ie, it should concatenate the two strings). We start our new string with \n, because that tells Mono to add a line return after whatever is in the text file already.

The WriteAllText() method is to ReadAllText() as Wilykit is to Wilykat: give it a filename as its first parameter and the text to write as its second parameter, and it will handle all the saving stuff for us.


All things being equal

If you use a single = when working with conditional statements, bad things happen. For example:

if (Name = "Snarf") {

That code actually means "assign "Snarf" to Name, and if that succeeded, then run the code inside the braces. Of course, such assignments nearly always work, so that will nearly always evaluate to be true and the message will be printed out. If you frequently type = when you really meant ==, you can solve the problem just by flipping the parameters. That is, these two statements do exactly the same thing:

if (Name == "Snarf") {
// or...
if ("Snarf" == Name) {

The difference is that if you type = rather than == in the second line, MonoDevelop will refuse to compile your program because you can't assign a variable to a string - "Snarf" is always "Snarf", and can't be changed.


Directories, loops and conditionals - oh my!

It's time to kick the ThunderClaw up a gear and get involved with some serious programming constructs: we're going to look at conditional statements and loops. Conditional statements are designed to execute only when a certain condition (specified by us) is true. If the condition fails (if the sky isn't blue, if the user's age isn't 26, or whatever else we've told it to check) then the code doesn't execute.

For example:

string Name = "Cheetara";
if (Name == "Snarf") {
  Console.WriteLine("Snarf snarf!")
}


That code will print nothing out, because the Name variable doesn't contain "Snarf". Note that == is just another operator, and means "is equal to". This is quite different to plain old =, which means "make equal to" (see the box "All things being equal").

Loops allow us to execute a given block of code multiple times. For example:

Console.Write("Thunder... ");
Console.Write("Thunder... ");
Console.Write("Thunder... ");
Console.Write("ThunderCats! HO!\n");

The "Thunder" part is pretty repetitive, so we can encapsulate that in one type of loop, known as a "for" loop, like this:

for (int i = 1; i <= 3; ++i) {
  Console.WriteLine("Thunder... ");
}
Console.WriteLine("ThunderCats! HO!\n");

Well, OK - in that example they both end up being four lines long, but what if we had to our operation 100 times? Or 100,000 times? Note that ++ is C# shorthand for "+= 1" - it just adds one to the variable.

C# has different types of loops, of which "for" is just one. We're now going to extend our program so that it reads all the files in a directory, and, if the files have the ".txt" extension, print out the contents. This requires loops and conditional statements all rolled into one - Sword of Omens, give me sight beyond sight!

string[] files = Directory.GetFiles("/home/paul");

foreach(string file in files) {
  if (file.EndsWith(".txt")) {
    Console.Write(File.ReadAllText(file));
  }
}

That shows off five new things all in one, so let me break it down:

If we give the Directory.GetFiles() a directory as its only parameter, it will return an array of strings (string[], remember?) containing all the filenames in that directory We can loop over most arrays using the foreach loop. This extracts each item in the array, and assigns it to a variable. In our example, we tell Mono to put each filename into the string "file". All strings have the EndsWith() method, which returns true if the string the method is being used on ends with the string we pass in as its parameter. If it returns true, the code inside the braces (Console.Write(...)) is executed.

Rather than assign the return value of File.ReadAllText() to another string, we just use it directly as the parameter to Console.Write. This is perfectly allowable, and helps make the code a little shorter to read. Notice how C# distinguishes between "File" (which is a special class that lets us read and write files) and "file", which is a string variable we created in our code. All C# variables are case-sensitive like this.

Replace the existing contents of the Main() method with that new code, making sure to change /home/paul to a directory where there are some .txt files around.

If you hit F5 you'll see it works fine, but it's more like ThunderKittens code than ThunderCats code. Have you ever heard the phrase "fast, good, cheap - pick any two"? Well, Linux is all about open source, so "cheap" is basically a requirement. But it turns out that Mono lets us do "fast" and "good" as well: we can rejig our code to make it do more and work faster at the same time!

The magic lies in the Directory.GetFiles() method. Right now we're giving it just one parameter, which is the directory we want to search. But in C# methods can do different things depending on how many parameters we give them. We can make our code /faster/ by specifying a second parameter to GetFiles(), which lets us specify a search filter for our filenames. So, rather than having to use file.EndsWith(".txt"), we can just provide a second parameter to Directory.GetFiles that is "*.txt". That way, our files string array will /only/ contain files that end with .txt.

We can make our code do /more/ by specifying yet another parameter to Directory.GetFiles: SearchOption.AllDirectories. This tells Mono that we want to search for files not only in the current directory, but in all subdirectories of the current directory too.

So, your new super-fast, super-featureful Main() method ought to look like this:

string[] files = Directory.GetFiles("/home/paul", "*.txt", SearchOption.AllDirectories);

foreach (string file in files) {
  Console.Write(File.ReadAllText(file));
}


The big finale

I'm running out of space here, so it's time to ask the ancient spirits of C# to transform this decayed code into something that actually works! We've already seen how we can snag all the filenames in a given directory (and its subdirectories), and we've already seen how we can read and write files. Now what we need to do is make our program do two different things:

If no parameters are given, we need to scan the filesystem and save the file list into a file of its own. This is our file cache. If a parameter is given, we're going to use this as the search parameter for our file list, and print out the contents of all the matching files.

What we're actually implementing is something very similar to the Linux "updatedb" command that's used to generate locate's file search cache. Of course, updatedb doesn't also print out the contents of the files it finds, so perhaps we can say our version is a bit better!

You already know most of the code needed to make this work, so let's just dive in. Replace your existing Main() code with this:

if (args.Length == 0) {
  string[] files = Directory.GetFiles("/home/paul", "*.txt", SearchOption.AllDirectories);
  File.WriteAllLines("filecache.snarf", files);
} else {
  string[] cache = File.ReadAllLines("filecache.snarf");

  foreach(string file in cache) {
    if (file.Contains(args[0])) {
      Console.Write(File.ReadAllText(file));
    }
  }
}

That code kicks off with something entirely new: args.Length. Last issue we saw that C# passes into our Main() method a string array called "args". To find out how big that array is (ie, how many parameters were passed in), we need to use args.Length. If that value is 0, no parameters we passed in, which means we need to generate the filename cache.

The WriteAllLines() method is very similar to the WriteAllText() method, with the exception that it takes a string array rather than a string as its second parameter. This is very helpful to us, because Directory.GetFiles() /returns/ a string array, so we can just pass that into WriteAllLines() to have it write each string in that array onto a line of its own in the file.

Again we have something new: else. You've already seen "if", which looks at a condition and evaluates some code if that condition is true. But what if the condition isn't true? That's where "else" comes in. For example, this code will print out "You're a ThunderCat!":

string Home = "Third Earth";
if (Home == "Middle Earth") {
  Console.WriteLine("You're a hobbit!");
} else {
  Console.WriteLine("You're a ThunderCat!");
}

In our Snarf code, the "else" statement is used to mean "if the number of parameters passed in isn't 0", meaning "run /this/ code if there actually were some parameters passed in." The code then goes ahead and calls ReadAllLines(), which reads each line of a text file into a string array.

Finally we have our main block of code: we loop over every file in the cache to see whether it matches the parameter passed in. This matching is done with the special Contains() method that every string has: pass it in a parameter as a string, and it will return true if it finds it. The args[0] variable, as you saw last issue, contains the first parameter passed in on the command-line. If the filename matches the parameter, we read its text in and print it out all one line.

That's it: our project is complete. Snarf searches our filesystem, it prints out all the files that match our user's parameter, and Third Earth is safe from Mumm-Ra for another day - all thanks to Mono!

On your disc you'll find the complete code for Snarf, with some extra bits and pieces from me to push you a bit further. Look out for messages being printed when the file cache is being created, the "numfilesfound" variable that gets incremented each time we match a file so that we can print out a message when no files are found, and an extra bit in there to print out the filename of each matching file.


Top tips

  • Along with EndsWith() and Contains(), strings also have Replace() (replace one substring with another), ToUpper() (convert the string to uppercase) and Trim() (remove whitespace from the beginning and end).
  • If you want to handle multiple parameters nicely, quit referring to args[0]. Instead, try using a foreach loop to extract each string parameter, and work on it individually.
  • Our code generates filecache.snarf when run with no parameters, but what if a user runs it with a parameter first? In this situation it will crash, as the cache hasn't been created yet. The solution is to use the File.Exists() method: if the cache exists, run the search. If it doesn't exist, create it first then run the search.