Total votes: 1
Print: Print Article
Please login to rate or to leave a comment.
Published: 27 Jun 2008
Download Sample Code
The road from web scraping, to geo-coded XML documents, to Google maps.
In the little world I know of, ninety-five percent of people are using Google maps; fifty percent of them are whipping up Google maps: superimpose colorful dots (markers), shapes (lines and polygons) and medias (images and videos) on a Google base map; and the rest are wondering: how can it be done? How did the guy yank house data from Craig’s List and create the very easy and useful home finder (http:// http://www.housingmaps.com/ How have Crime.org (now everyblock.com) captured crime data for major cities such as Chicago, New York and presented them in their maps (not quite Google-map looking though) community by community, block by block?
Carrying the current of Google Mapping is the tide of web mashups. According to http://www.programmableweb.com/ (a website showcasing mash up application), on average three new mash-ups spring up everyday. Originally referred to the practice of mixing and remixing two or more music pieces to create a new song, in web programming a mashup is a web application that combines data and services from more than one source into a single integrated tool (Wikipedia: mashup entry). The tide of mashups has washed over industries from mapping to entertainment, from social networks to shopping.
To me, the gold in mashups is data. As long as we have data, however present it, only the sky is the limit (Data is the next intel inside, as Tim O'reilly claims in his widely read article "What is web 2.0"). The biggest web contenders have offered a number of mash-up tools (for example, Intel Mash Maker) for regular users to create their unique web products. However, data, or the lack of structured data often become the stumbling block that keeps them from going very far. On the other hand, more and more internet companies have opened up their data store and provided APIs for developers to easily harness their data and services, so now marrying the data queried from Yahoo to your own web application would be as easy as querying your own database. However, the vast majority of data is still encased in all sorts of old fashioned HTML tags and presented in jumbled styles, which poses great challenges for people to efficiently apply them for new purposes or in mashups.
The secrets of how the guy mashed up real-estate data from craigslist with Google map API maybe forever in his keep, however, I have wrung my head to create a mostly generic web application to scrap and structure address level data from a given website and then map the data using Google map API, which I think can be easily modified to serve other mashup purposes. The following is the detailed implementation. These days, since everyone is trading their cars for bikes (for environmental, financial, heath reasons), I chose a Chicago bike shop website (http://www.chicagobikeshops.info/) as my guinea pig. (View the resulting Google map (very raw) of Chicago bike shops)
From table dressed-up address level data to the final pushpin Google map, it takes three major steps:
- Scrape a given website, isolate and write addresses data into a Xml document
- Input XML file for Geocoding
- Map the XML formatted, geocoded data
Figure 1: From a webpage to a Google map
Web Scraping and Data Parsing
WebScraping is created for the purpose of web scraping and data parsing.
1. Download Data
To consume a web page as a whole is easy. The
System.Net.WebClient class provides functionality to upload data to or download data from the Internet or intranet or a local file system. There are three methods for downloading data:
Listing 1: WebClient Downloading Data Method 1
Listing 2: WebClient Downloading Data Method 2
Listing 3: WebClient Downloading Data Method 3
WebRequest class can also send request to a URL and use
GetResponseStream to get data. As the following:
Listing 4: WebRequest class for fetching data from a web page
For the example, I use a WebClient's
OpenRead method. Before passing the whole page for processing, I clean the page a bit, getting rid of some annoyances such as the special character ", replacing html tag
<br /> with
Listing 5: Scrap a web page as a whole and do a bit cleaning
2. Parse data
To comb a structure out of a structure-less html page, normally people use very sophisticated (mind-numbingly complicated) Regular Expressions. There is an old fashioned way of string manipulation to weed out extraneous text. To me, it is daunting to write regular expressions to grab chunks of data spanning paragraphs and varied tags, but it is a no-brainer to parse data using regular expressions such as URL.
The process of parsing a web page is similar to peeling onions layer after layer, until we reach the core. Assume the address level data is presented in a table of rows and columns; each row contains a record, each cell a data field. First, we cut into the data table, then we extract all the rows, then we grab data from each cell, at which point we assemble and structure all data into an XML file.
WebScraping class, there are two similar methods to extract data using simple string functions. One (
getDataBetweenTags) deals with single data entry in a given text string, the other (
getDataListBetweenTags) deals with multiple data entry points.
On with the code:
Listing 6: Extract one single data wedged between a pair of start and end tags from a string
Listing 7: Extract from a string a list of records (for example table rows), each records sits in between a pair of start and end tags.
I also use regular expressions to fine tune some data fields. For example, in our case, the first table cell contains an URL that is encoded in the typical HTML format:
Listing 8: Parse hyperlink using regular expressions
Let’s see an example of how to use the above methods.
First use the
getDataBetweenTags method to locate the table or where the data chest wrapped in any format:
Then proceed to grab the data rows:
At this point, I wish I could say I have found a way to write one-size-fits-all method to parse rows and get granular cell data and write them to a generic XML file. However, anticipating that all web pages are different (data points could be merged, spliced, omitted or repeated) I decided the best way was to deal with different pages more or less differently.
The following is how I treated my bike shop page into a structured XML document, with essential address information such as city, state added and urls parsed.
Listing 9: Parse data item by item and write data in a XmlDocument using XmlTextWriter
Geocoding an XML file
Quoting Wikipedia, "Geocoding is the process of assigning geographic identifiers (e.g., codes or geographic coordinates expressed as latitude-longitude) to map features and other data records, such as street addresses. You can also geocode media, for example where a picture was taken, IP addresses, and anything that has a geographic component. With geographic coordinates the features can be mapped and entered into Geographic Information Systems."
With the popularity of the various mapping applications, there comes a number of geocoding APIs for online Geocoding. A search would give us Yahoo Geocoding API, Google Map API, Geocoder.us and others. I found the Yahoo Geocoding restful web service (Representational State Transfer, basically meaning that you can use
GET to query and get a set of data result in XML) easier to handle and use as an independent component. Yahoo API also is more generous with developers, it allows 50,000 queries per day per IP address (in comparison, Google allows 50,000 queries per developer per day).
50,000 queries per IP address per day sounds a lot if you only want to do a map of where your friends live, however sometimes, you would have a table of over a thousand records and you will quickly exhaust your query allowances. Each page request would incrementally cripple your web application if the thousands of addresses were to be geocoded again. That is why you should always try to avoid geocoding and re-geocoding unnecessarily.
Even without the query constraints, Geocoding is still a costly process in terms of both time and resources. It is absolutely hideous if you overlook this issue and request for online geocoding over and over again. If you can do geocoding offline, by all means do it offline. However, offline geocoding breaks the streamline process of mashing up two or more real-time websites (plus, I do not want to do anything offline), so my strategy is scrape the target website for addresses and geocode the addresses once per day, then store the geocoded result in an XML file.
On with Geocoding and the Yahoo API.
My geocoding class is stolen from the
YahooGeocoding Class presented in the article Mash-it Up with ASP.NET AJAX: Using a proxy to access remote APIs (Thanks!), although I made some serious modifications / additions and have a completely different output.
YahooGeocoding helper class has two overloading functions named
Geocode. One deals with a single address geocoding request and output one pair of latitude and longitude, the other deals with geocoding multiple addresses fed by a
XmlDocument and return another
XmlDocument with latitude, longitude as new elements appended to each address node.
Both functions use standard XML parsing techniques. The single address geocoding function uses forward-read-only
XmlTextReader (copied from the above mentioned Mash-it-up article). I personally like
XmlDocument better to traverse XML, so the batch-geocoding function takes three parameters: an input XML file path, an output XML file path, and a
Xpath string to denote where the address branch resides. Upon creating
XmlDocument from the XML file, it cuts through a list of address nodes, passes it off to the single address geocoding class, while an
XmlTextWriter opens a new
XmlDocument in the specified path and write all the original XML nodes together with latitude and longitude for each addresses.
Listing 10: Batch Geocoding function
Listing 11: Single Address Geocoding Class
Time for mapping
Once we have the geocode data, we can call the greedy
GDownloadUrl method provided by the Google Map API. I call it greedy, because
GDownloadUrl is not discriminating, it consumes a text file, a XML document, or any server pages such as php, asp, aspx pages. It is a wrapper function that takes care of the core AJAX concept of making
XmlHttpRequest call to remote server for data sharing and exchange.
Making a pushpin-type of map (some of my colleagues are professional mappers, they call this as dropping dots) is very easy, and the first step is always in the same formula:
Listing 12: Google Map routine initiation and declaring
Since this is a generic map application, we do not have a preset geographic boundary, level of detail, number of records, we would like to set the map boundary and center based on available records, therefore, we do the following
At this point, we set to fetch the data, create our push-pin markers, and superimpose them onto a Google map, as in the following code:
Listing 13: Process data, create markers and add them to the map
A couple of points to note:
1. Map boundary, center and zoom level
The map boundary is dynamically set and expanded whenever a map points is thrown into the picture.
Then we can set the zoom level and map center after finishisg processing all of the points:
2. Makers can be created as simply as a dropping of redl; tear (the default Google map icon) or as fancy, dynamic and fluid as your imagination. The example uses an open-source labelMaker.js for the purpose, simply because of the author's tendency to try things new, yet it is in the most uncreative way. You are encouraged to explore all the possibilities out there.
In an effort to explore a fairly generic path from a given web page to structured xml data file to finally mash it up with a map service, I created two classes for web scraping, data parsing and multiple addresses geocoding using Yahoo API, and presented an example how to use the classes and map the data. In many ways it may look crude and clumsy, however, present it here as an invitation for better, more elegant and efficient solution to harvest the data of our virtual world: the wild wide web.
Web developer, Data Analyst, GIS Programmer
This author has published 12 articles on DotNetSlackers. View other articles or the complete profile here.
Please login to rate or to leave a comment.