How do you get information from a website to a Excel spreadsheet? The answer is screenscraping. There are a number of softwares and plattforms (such as OutWit Hub, Google Docs and Scraper Wiki) that helps you do this, but none of them are – in my opinion – as easy to use as the Google Chrome extension Scraper, which has become one of my absolutely favourite data tools.
What is a screenscraper?
I like to think of a screenscraper as a small robot that reads websites and extracts pieces of information. When you are able to unleash a scraper on hundreads, thousands or even more pages it can be an incredibly powerful tool.
In its most simple form, the one that we will look at in this blog post, it gathers information from one webpage only.
Google Chrome’s Scraper
Scraper is an Google Chrome extension that can be installed for free at Chrome Web Store.
Image
Now if you installed the extension correctly you should be able to see the option “Scrape similar” if you right-click any element on a webpage.
The Task: Scraping the contact details of all Swedish MPs
Image
This is the site we’ll be working with, a list of all Swedish MPs, including their contact details. Start by right-clicking the name of any person and chose Scrape similar. This should open the following window.
Understanding XPaths
At w3schools you’ll find a broader introduction to XPaths.
Before we move on to the actual scrape, let me briefly introduce XPaths. XPath is a language for finding information in an XML structure, for example an HTML file. It is a way to select tags (or rather “nodes”) of interest. In this case we use XPaths to define what parts of the webpage that we want to collect.
A typical XPath might look something like this:
//div[@id="content"]/table[1]/tr
Which in plain English translates to:
// - Search the whole document...
div[@id="content"] - ...for the div tag with the id "content".
table[1] - Select the first table.
tr - And in that table, grab all rows.
Over to Scraper then. I’m given the following suggested XPath:
//section[1]/div/div/div/dl/dt/a
The results look pretty good, but it seems we only get names starting with an A. And we would also like to collect to phone numbers and party names. So let’s go back to the webpage and look at the HTML structure.
Right-click one of the MPs and chose Inspect element. We can see that each alphabetical list is contained in a section tag with the class “grid_6 alpha omega searchresult container clist”.
And if we open the section tag we find the list of MPs in div tags.
We will do this scrape in two steps. Step one is to select the tags containing all information about the MPs with one XPath. Step two is to pick the specific pieces of data that we are interested in (name, e-mail, phone number, party) and place them in columns.
Writing our XPaths
In step one we want to try to get as deep into the HTML structure as possible without losing any of the elements we are interested in. Hover the tags in the Elements window to see what tags correspond to what elements on the page.
In our case this is the last tag that contains all the data we are looking for:
//section[@class="grid_6 alpha omega searchresult container clist"]/div/div/div/dl
Click Scrape to test run the XPath. It should give you a list that looks something like this.
Scroll down the list to make sure it has 349 rows. That is the number of MPs in the Swedish parliament. The second step is to split this data into columns. Go back to the webpage and inspect the HTML code.
I have highlighted the parts that we want to extract. Grab them with the following XPaths:
name: dt/a
party: dd[1]
region: dd[2]/span[1]
seat: dd[2]/span[2]
phone: dd[3]
e-mail: dd[4]/span/a
Insert these paths in the Columns field and click Scrape to run the scraper.
Click Export to Google Docs to get the data into a spreadsheet.
Source: http://dataist.wordpress.com/2012/10/12/get-started-with-screenscraping-using-google-chromes-scraper-extension/
What is a screenscraper?
I like to think of a screenscraper as a small robot that reads websites and extracts pieces of information. When you are able to unleash a scraper on hundreads, thousands or even more pages it can be an incredibly powerful tool.
In its most simple form, the one that we will look at in this blog post, it gathers information from one webpage only.
Google Chrome’s Scraper
Scraper is an Google Chrome extension that can be installed for free at Chrome Web Store.
Image
Now if you installed the extension correctly you should be able to see the option “Scrape similar” if you right-click any element on a webpage.
The Task: Scraping the contact details of all Swedish MPs
Image
This is the site we’ll be working with, a list of all Swedish MPs, including their contact details. Start by right-clicking the name of any person and chose Scrape similar. This should open the following window.
Understanding XPaths
At w3schools you’ll find a broader introduction to XPaths.
Before we move on to the actual scrape, let me briefly introduce XPaths. XPath is a language for finding information in an XML structure, for example an HTML file. It is a way to select tags (or rather “nodes”) of interest. In this case we use XPaths to define what parts of the webpage that we want to collect.
A typical XPath might look something like this:
//div[@id="content"]/table[1]/tr
Which in plain English translates to:
// - Search the whole document...
div[@id="content"] - ...for the div tag with the id "content".
table[1] - Select the first table.
tr - And in that table, grab all rows.
Over to Scraper then. I’m given the following suggested XPath:
//section[1]/div/div/div/dl/dt/a
The results look pretty good, but it seems we only get names starting with an A. And we would also like to collect to phone numbers and party names. So let’s go back to the webpage and look at the HTML structure.
Right-click one of the MPs and chose Inspect element. We can see that each alphabetical list is contained in a section tag with the class “grid_6 alpha omega searchresult container clist”.
And if we open the section tag we find the list of MPs in div tags.
We will do this scrape in two steps. Step one is to select the tags containing all information about the MPs with one XPath. Step two is to pick the specific pieces of data that we are interested in (name, e-mail, phone number, party) and place them in columns.
Writing our XPaths
In step one we want to try to get as deep into the HTML structure as possible without losing any of the elements we are interested in. Hover the tags in the Elements window to see what tags correspond to what elements on the page.
In our case this is the last tag that contains all the data we are looking for:
//section[@class="grid_6 alpha omega searchresult container clist"]/div/div/div/dl
Click Scrape to test run the XPath. It should give you a list that looks something like this.
Scroll down the list to make sure it has 349 rows. That is the number of MPs in the Swedish parliament. The second step is to split this data into columns. Go back to the webpage and inspect the HTML code.
I have highlighted the parts that we want to extract. Grab them with the following XPaths:
name: dt/a
party: dd[1]
region: dd[2]/span[1]
seat: dd[2]/span[2]
phone: dd[3]
e-mail: dd[4]/span/a
Insert these paths in the Columns field and click Scrape to run the scraper.
Click Export to Google Docs to get the data into a spreadsheet.
Source: http://dataist.wordpress.com/2012/10/12/get-started-with-screenscraping-using-google-chromes-scraper-extension/
No comments:
Post a Comment