Build an RSS reader
From LXF Wiki
Do you get the jitters when you haven't checked Slashdot for five minutes? Paul Hudson encourages you to read all about it with Mono...
The internet has a very low signal-to-noise ratio. To put it bluntly, the internet is full of crap - blogs about people getting to grips with their teenage angst, countless arguments about whether the Polish city is called Danzig or Gdansk and millions of hours of YouTube videos of kittens. But there are a few remaining sites of worth around, if you can find them. The problem is: how can you wade through all the rubbish to get at the good stuff? Here's where RSS steps in: it lets people subscribe to sites that interest them, then receive updates as the sites change. For example, if you like reading the BBC News site, but hate having to browse there every 30 minutes to see what's changed, RSS is for you.
Last issue we looked at working with files, and this issue we're going to take that a bit further by working with XML files. XML is very similar to HTML, in that it marks blocks of text as being something specific. But unlike HTML, XML doesn't use this markup to denote how things should look. Instead, XML says what things mean, which makes it perfect for sending data around. XML comes in many different types, and we're interested in RSS: Really Simple Syndication. This stores news items that get updated whenever the main site gets updated, which means people can tell their computer to download the RSS feed once every ten minutes, then read the headlines without needing to touch a web browser.
Mono comes with lots of tools for working with XML, so we don't have to worry about how to read the files. That means we can focus on what we intend to do with it, so let me tell you. We're going to:
- Build a program that can download RSS feeds and print them neatly.
- Make the program track the feed so that it only shows things that have appeared since we last checked.
- Have the program remember the feeds that users were interested in.
Just like last issue, we're going to build a real program that is actually useful - let's go!
| Table of contents |
Hunting the RSS
Before ye go hunting the RSS, the first thing you need to do is understand exactly how the RSS beastie looks. It may look complicated on the surface, but we can break it down into two parts: the channel description block and the news items themselves. Below is a prize specimen RSS feed, and you'll see that the channel (the news feed) has a title, description and link. This is all meta-information describing him - you can safely ignore this if you just want the news. You'll also see that there are two <item> elements, but there could easily be hundreds depending on what kind of RSS it is you catch. These are the actual news items, and again contain title, description and link fields, but this time these are specific to each individual news story. Here he is:
<?xml version="1.0" ?>
<rss version="2.0">
<channel>
<title>My Excellent Site</title>
<description>There's lots of great content here - please subscribe!</description>
<link>http://www.example.com</link>
<item>
<title>Mono rocks!</title>
<description>Free .NET takes over world</description>
<link>http://www.example.com/news/mono</link>
<guid>http://www.example.com/news/mono</guid>
</item>
<item>
<title>Mono beats PHP</title>
<description>Consistent function naming wins the day</description>
<link>http://www.example.com/news/monovsphp</link>
<guid>http://www.example.com/news/monovsphp</guid>
</item>
</channel>
</rss>
You'll notice that each <item> has identical <link> and <guid> elements. 'GUID' is short for globally unique identifier, and means any value that is unique to that exact story across the whole internet. This is required for RSS feeds, as its used to let RSS programs know if they've seen that news story before or not. You need to be careful to choose GUIDs that are both a) unique to your site, and b) unique to other sites. The easiest (and most common) way to do this is just to use the link to the story as the GUID, because it's guaranteed to be unique.
So that's the blueprint of RSS. Now let's try it with a real example: the BBC news homepage. Following the instructions from the last two issues, start a new console project in MonoDevelop and call it whatever you please. In the "using" lines at the top, add this:
using System.Xml;
You also need to right-click on References in the Solution pane on the left of MonoDevelop's window, then select Edit References from the menu that appears. Make sure System.Xml is selected from that list, and click OK. Now change the Main() code to this:
XmlDocument doc = new XmlDocument();
doc.Load("http://tinyurl.com/8mwkm");
Console.Write(doc.InnerXml);
The TinyURL link is there to save space - you can use the full URL if you want to: http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml.
That code uses the XmlDocument class to read a URL then print it to the screen. We're not doing anything fancy with it, we're just printing it out to the console. Hit F5 to compile and run the program, and, after a moment's delay while the RSS file is downloaded, you should see a large chunk of text printed out in the Application Output pane in MonoDevelop. This is our RSS - it may look like a complex wee monster, but we're going to tame him!
Just the headlines
How did the Egyptians manage to build something as amazing as the pyramids using only ancient technology and a few thousand Israelites? Easy: whips! And we can whip our RSS into shape equally easily using some rather excellent .NET methods: SelectSingleNode() and SelectNodes(). These let you search through XML for the exact data you're interested in, and either returns just one XML node (the name for an XML element such as <item> once it has been read by our program) or returns all the matching nodes.
So, what we want v2 of our program to do is to read all the news items, then print the headline and description information from each story. Here's my recipe for Ye Olde Hudson RSS Reader v2:
1.Preheat your RSS by passing it through XmlDocument.load(). 2.Peel away the skin to reveal only the <item> elements we care about 3.Gently sift through the <items>, sprinkling their data over Console.Write() as necessary 4.Season with salt, and serve
Or in the more conventional C#...
XmlDocument doc = new XmlDocument();
doc.Load("http://tinyurl.com/8mwkm");
XmlNodeList items = doc.SelectNodes("//item");
foreach (XmlNode item in items) {
Console.WriteLine(item.SelectSingleNode("title").InnerText);
Console.WriteLine(" " + item.SelectSingleNode("description").InnerText);
Console.WriteLine("");
}
The parameter that gets passed into SelectNodes - //item - is known as XPath. This is the special way of searching for things inside XML, and our example means 'get any <item> element, anywhere in the XML'. That's what the // means: 'get any'. Take a look at this XML:
<stuff>
<clothing>
<item>Trousers</item>
<item>Socks</item>
</clothing>
<news>
<item>Wii released</item>
<item>Xbox 360 sucks!</item>
</news>
</stuff>
If we use the XPath //item to get news items from that XML, we'll be disappointed: it will pull out items of clothing and items of news in the same search! Rather than using the 'get any' search //, you would need to be more specific and say you only want <item> elements that are part of <news> elements. In XPath, you would use /news/item.
Our RSS feed only uses <item> when its referring to news items, so using //item is safe enough for now. This search gives us back a variable that's known as an XmlNodeList. If an XML node contains one XML element, then an XmlNodeList contains several XML elements, right? Right. I just wanted to make sure you hadn't lost the plot while we were discussing XPath!
Once we have a list of all the news items, it's just a matter of printing them out. Last issue I introduced the foreach loop, and now it's back - and working with XmlNodes rather than plain old strings. This loop goes through each news item that was returned from SelectNodes(), and puts it into the "item" variable ready for us to read. Each <item> in our XML contains several interesting children of its own: the title of the news, the description, the link, and so on. To extract each of these, we need to use the SelectSingleNode() method on our item, which gives us an XmlNode. So to get the title of a news item, we need to use item.SelectSingleNode("title"). But that just gives us an XML node, which is just a .NET representation of the <item> XML element as opposed to the actual contents of the XML node. That's what the InnerText part does: it retrieves the actual textual content from an XmlNode object.
So, with all that in mind, here's one of those code lines again:
Console.WriteLine(item.SelectSingleNode("title").InnerText);
That works out as:
- Using the current item...
- Get its title node...
- Then get the text of that title node...
- And print it out to the console.
After the headline and description is printed out for a news story, Console.WriteLine() is called with an empty string so that it prints a blank line between stories.
That's it: compile and run your program with F5, and be amazed at how your wonderful culinary skills have transformed the raw ingredients of an RSS feed into a readable printout on your screen!
What's new, pussycat?
Our program has a problem: RSS feeds can be long, and really people only care about what's changed since they last checked the feed. This is a real problem: how can we track which RSS news items people have read already, and only show the ones they haven't seen? Well, cast your mind back a thousand words or so, and you'll remember globally unique identifiers. Here's what I said: "This is required for RSS feeds, as its used to let RSS programs know if they've seen that news story before or not." Each RSS news item needs a GUID so that it's absolutely unique on the web, and we can use that to know whether we've seen something before or not.
Here's how it should work:
- Get the RSS feed
- Store all the GUIDs, one per line, in a file
- Next time the RSS feed is loaded, only show news items if they don't appear in our list of cached GUIDs.
It's only three measly steps, but actually programming the thing is a bit harder. Here's how the new Main() method ought to look - I've added comments throughout to explain how it all works:
XmlDocument doc = new XmlDocument();
doc.Load("http://tinyurl.com/8mwkm");
// this string array will store the contents of our cache file
string[] guidcache;
if (File.Exists("guidcache.txt")) {
// we have a cache file - go ahead and read it in!
guidcache = File.ReadAllLines("guidcache.txt");
} else {
// we don't have a cache file - create a new string array with 0 elements (ie, it's empty)
guidcache = new string[0];
}
// grab all the news items as per usual...
XmlNodeList items = doc.SelectNodes("//item");
foreach (XmlNode item in items) {
// presume by default we're going to show the user this news item
bool showthisitem = true;
// now go through each GUID in our cache...
foreach (string guid in guidcache) {
// ... and compare it against the GUID of this news item
if (guid == item.SelectSingleNode("guid").InnerText) {
// if we're here, we've got a match - don't show this item!
showthisitem = false;
// this tells C# to exit the loop - we've matched the GUID, and so don't need to check against other GUIDs in the cache
break;
}
}
if (showthisitem) {
// we can only get here if the GUID isn't in our cache - print it out!
Console.WriteLine(item.SelectSingleNode("title").InnerText);
Console.WriteLine(" " + item.SelectSingleNode("description").InnerText);
Console.WriteLine("");
// ... now add the GUID to our cache file for next time.
File.AppendAllText("guidcache.txt", item.SelectSingleNode("guid").InnerText + "\n");
}
}
That's the easiest way to write the code, but if you're looking for something that runs a bit faster, I suggest you insert this just after the bool showthisitem line:
string thisguid = item.SelectSingleNode("guid").InnerText;
So, rather than having to call SelectSingleNode() for every GUID in the cache and for every news item, that line caches it the GUID in a string variable that you can use instead of the other SelectSingleNode() calls.
Subscribe today!
Let's take our program to warp speed: right now we have the BBC URL right in our source code, which is generally referred to as being hardcoded. But what if people want to read another news source? Or what if they want to read several news sources and update them all simultaneously? This requires some more advanced coding, but it does start to make our program useful at last.
As I see it, our program has to be able to do the following:
- When provided with the parameter sub followed by a URL, it should subscribe to that feed.
- When provided with the parameter unsub followed by a URL, it should unsubscribe to that feed.
- When provided with no parameters at all, it should refresh all the RSS feeds and show all the new entries.
- When provided with the parameter reset it should clear the GUID list and refresh the feeds, showing all entries in all the feeds.
That's nothing too far above our current code, but there is one subtle change here: actions 3 and 4 both need to print out the RSS feeds. Now, the coarse way to solve this is to select all the RSS printing code we have already, then copy and paste it so the same code is in our program twice. This works, but it sucks. It sucks because it increases the size of our program, and it sucks because if we fix a bug in one place we have to remember to fix it in the other place too. A much better solution is to create our own method that can be called from anywhere, and centralises all the code in one place.
But first, we need to write the code to subscribe and unsubscribe to our feeds. This is where you can learn about another cool new thing thing: the switch/case block. You've already met the conditional statement (think if/else!), but that gets rather hard to read if you're checking multiple things. The switch/case block lets us check a variable against multiple different values without making our code messy. For example, our basic code to check what operation the program should perform would look like this:
switch (args.Length) {
case 0:
// refresh the feeds!
break;
case 1:
// reset the feeds!
break;
case 2:
// sub or unsub to a feed!
break;
}
What that does is check to see whether args.Length is equal to 0, 1 or 2. The args.Length value gets automatically set to the number of parameters that were passed into our program from the command line, so this is really saying "how was this program run?" If no parameters are provided, we need the program to refresh all the feeds. If one parameter is provided, we can just go ahead and reset the feeds - we don't need to check what that parameter is, because the only reason our program would be called with just one parameter is to reset the feeds. Finally, if two parameters are called we need to check whether it's a sub or an unsub, then take the appropriate action.
When the switch finds a matching case, it goes ahead and executes all the code from there until it finds either another case line or a break instruction. So the "break" line is really just there to say "I'm finished with this match, pick up at the end of this switch/case block". In C#'s predecessor, C++, if you didn't have a break statement in there, the program would just carry on executing the next case statement regardless of whether it matched or not. In C# this doesn't happen, but you need to use break anyway.
So, we're going to deal with the subscribing and unsubscribing first. This needs to check whether sub or unsub was provided, then it add the feed to the subscription list. Here goes:
case 2:
// sub or unsub!
if (args[1] == "") return;
if (args[0] == "sub") {
// add the site to the existing list
File.AppendAllText("sitelist.txt", args[1] + "\n");
} else {
if (File.Exists("sitelist.txt")) {
// remove site from the list
string[] sitelist = File.ReadAllLines("sitelist.txt");
File.Delete("sitelist.txt");
foreach (string site in sitelist) {
if (site == args[1]) {
// aha! this is the site we need to drop; ignore it
} else {
File.AppendAllText("sitelist.txt", site + "\n");
}
}
}
}
break;
Subscribing is pretty simple, but unsubscribing is a little trickier. In the code above, it works by reading in the sites file then deleting it. It then goes over every site that the user currently subscribes to and writes it out line-by-line to the sites file. But when it finds the site that they want to unsubscribe from, it skips over it.
The other two cases are much easier, and ought to look like this:
switch (args.Length) {
case 0:
// refresh!
ReadFeeds();
break;
case 1:
// reset!
File.Delete("guidcache.txt");
ReadFeeds();
break;
The ReadFeeds() method is what I meant about code re-use: we could paste all the code needed to read feeds directly into both case statements, but it's faster to just create our own method: ReadFeeds(). So, when the program is called without any parameters, ReadFeeds() is called immediately. When it's called with a single parameter, we clear the GUID cache /then/ call ReadFeeds().
The ReadFeeds() method itself is largely the same as the old RSS reading code, but we need to modify it so that it can read from multiple sites. To do this, we need to read in the site subscription list, then loop over each site as well as looping over individual news items in each site. Here's the important bit:
string[] sitelist;
if (File.Exists("sitelist.txt")) {
sitelist = File.ReadAllLines("sitelist.txt");
} else {
sitelist = new string[0];
}
foreach (string site in sitelist) {
XmlDocument doc = new XmlDocument();
doc.Load(site);
It's nothing new, but it does finish of our program perfectly. Hit F8 to compile the program without running it, then open up a terminal and browse to the location of your MonoDevelop project. From there, look for the bin/Debug directory, and you should find an executable waiting for you. Give it a try - I think you'll agree it's actually very useful! And we've come a long way: you should no longer be afraid of XML, you've learnt why creating your own methods is a good thing, and you've also learnt how to use the switch/case block for cleaning up complex conditional statements. More importantly, you've built your second working and useful project using C# and Mono - well done!
Top tip
We use the reset parameter to clear the RSS GUIDs for all our feeds so that our program will download all the news from all the feeds. If you want to let people provide a URL parameter to reset to reset only that feed, the easy way to do it is to store GUIDs in a per-site file - rather than guidcache.txt, you could have guid-news.bbc.co.uk.txt. To clear the GUID cache for one site, just delete that site's file.
Top tip
There's a File.Create() method for creating files, but we don't need it - File.AppendAllText() automatically creates the file if it doesn't already exist.
Top tip
It's important to check how many arguments are passed in before reading from the args array, because Mono will crash if you try to read an element that doesn't exist. Be careful!

