Mobile Zone is brought to you in partnership with:

Tim Dams is a teacher at the Artesis University College in Antwerp, Belgium where he is mainly involved in software engineering and programming courses. In his spare time he likes to write Silverlight, WPF and Windows Phone 7 applications and blog about the things he learns in the process. Tim is a DZone MVB and is not an employee of DZone and has posted 19 posts at DZone. You can read more from them at their website. View Full User Profile

Writing a Windows Phone Website Scraper Application

04.21.2012
| 3940 views |
  • submit to reddit

In this tutorial I will explain how you write a WP7 application using the HtmlAgility Pack in order to use information scraped from a website.

Website scraping is the act of retrieving information from a website page. An act by some considered stealing, by others borrowing. Let’s leave that debate to the others. In this post I will show how easy it is to scrape content from a website so that you can (re)use it in your Windows Phone 7 application. As it is, this information will for the most part also work in other, non WP7, projects of course.

Sometimes website scraping is the only means available to consume certain information from a website. If the website doesn’t have some publicly available API or web service you can use you’re pretty much left with scraping, whether you like it or not.

Now before reading on, it is extremely important to understand that there are legal issues concerning scraping: basically, as far as I understand it, you’re only allowed to use scraped data if you have clearance to do so by the website owner (i.e. the one that ‘owns’ the data).

HtmlAgility Pack

To get started, we will first need the Html Agility Pack (HAP) which is a very thorough HTML parser (read all about it here). The nice thing about HAP is that it also supports XPATH queries and Linq to Objects, making it actually fun (in a geeky way) to perform webscraping in C#. Unfortunately, currently ‘HAP for WP7’ doesn’t support XPATH queries so we’ll use Linq to Objects for the remainder of this tutorial.

For some nice demos of what can be done with HAP, make sure to check out the following tutorial.

When you download HAP from codeplex, you will still need to build the WP7 dll manually. Following post describes the steps needed to do this. For the lazy people amongst us, here you will find the compiled dll, based on the HAP version of February 2012 (make sure to ‘unblock’ the dll if you downloaded it from this site: rightclick the file, choose properties and then click ‘unblock’).

Add reference to HAP in WP7

Next we need to add a reference to the HAP dll in order to use all the sweetness it contains. Rightclick in your solution explorer on the References folder and choose Add Reference.

Add reference

Next point to the downloaded or compiled HtmlAgilityPack.dll file and choose Add.

Finally add a using statement and we are all set to go:

using

Download page to scrape

We use the LoadAsync method from the static HAP class HtmlWeb in order to download and parse our html file. We provide the url to download, as well as the callback method once the file is downloaded and processed by HAP:

loadasync

Next we define the callback method in which we immediately check if the download went well. Once that is done, we can start querying our parsed html document:

callback

Alternatives to downloading pages directly

As a side note, you might as well use the WP7 WebClient.DownloadStringAsync method to download the html. Afterwards you can then feed the downloaded string (or a string you retrieved through some other obscure manner) to HAP using the LoadHtml() method:

LoadHtml

Discovering the page layout

For the purpose of this demo, we would like to retrieve the url to latest xkcd.com joke image. In order to do this, the best way is to open your browser and open the developer’s tool by hitting the F12 key (works for me in Chrome and IE9). I prefer to use Chrome for this because of the simple fact that the currently selected element is highlighted on the page itself, making it easier to rapidly drill down to the element you need.

To rapidly view the position of a specific element inside the html DOM, simply right click the element on the page and choose “Inspect element”. So we right click the image and choose “Inspect element”:

devtools

Using the developer tool you now have to look for the element(s) and/or attributes needed. Once correctly identified the needed element, you need to find the most straightforward way to retrieve to element in your code:

  • if the html has abundant div and/or other elements with unique id’s, it is simply a matter of finding that specific element by filtering out to the unique id.
  • if the element has no unique identifier and is part of a group of equal elements you will need to iterate over all these elements and do some manual comparisons using for example regular expression; or hardcode the exact position (e.g. retrieve the 3th img element from a given node).


Note, at the bottom of the developer tools you’ll also notice the full ‘path’ of the element, which can be handy if you’re getting lost in more advanced (or crappy) pages.

To be honest, the ‘hardest part’ in my opinion is identifying the correct to the wanted element.

Using Linq-to-objects to get the goodies

Once we have identified the correct way of retrieving the element (or a certain attribute value of that element) it is time to start writing the necessary linq code. The basic principles are explained very clearly in this blog, so I’ll immediately dive into some more ‘hardcore’ *ahem * html.

For this demo we need the value of the “src” attribute of the img-element, inside the uniquely named div-element with id “comic”

First we’ll try to ‘capture’ the unique div, if we don’t find that one we can be pretty certain that some error occurred (e.g. our url is wrong, the site has changed its html, etc.):

query1

So here we capture all the child nodes of the document that are div-elements whose “id” attribute equals “comic”. Notice that a new IEnumerable collection is returned: we can feed this collection to a new query or iterate over this collection with, for example, a foreach loop.

Because we actually know that there will be only one element, or none at all, we will change our query to:

query2

By using FirstOrDefault() we can then check if the correct element was found or not (it will be null if not found), without having to cope with possible exceptions:

startscrape

Once inside the if-braces, we can be fairly we’ll find the desired image url.

Because we know that inside the comic-div there will be only one element of the img-kind we can retrieve the value of its attributes using the following statement:

getimg

The Element() method returns the single element  of the “img” type. Next we we retrieve the value of the “src” attribute of the -element.

It should be noted that the slightest mistake in your queries can result in some exception being thrown; make it a practice to catch these (otherwise your WP7 will simple quit without any notification should an HAP exception occur).

Using Element and Elements

Depending on your preference, there’s several ways to retrieve the data you need. For example, if we are 100% certain that the page will have the html layout we need, we can drill down to the needed value using the Element/Elements methods, as follows.

  1. You use Element(string type) if you are certain that the current node will have one and only one child element of the given type , that is passed as parameter.
  2. You use Elements(string type) to retrieve all the child nodes of the current node of a given type.


In order to retrieve our image url, we could open the developer tool and note the unique path to it on the bottom of the window:

path

It’s now a matter of translating this path to the correct fluent method chain (is that a correct word?) , resulting in the following:

elementquery1

In fact, we could even hardcore this more (not always recommended). We know that we need the second div element of the body, and inside that div we again need the second div-element. Se we could write and thus skip the need for specifying the filters:

elementquery2

Note: I can’t really proclaim to be a skilled Linq writer, so if any of these steps can be done more quickly, don’t hesitate to mention so.

Cherry on the pie: show the image

To show that this works, suppose we have the following , very empty, WP7 xaml page on which we wish to load the joke:

xamlpage

All that needs to be done is assign the retrieved url as a new source to the Image control,named jokeimg:

final

That’s all folks!

 

Published at DZone with permission of Tim Dams, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)