OutWit Hub is a Web browser built specifically for extracting data from webpages. This is very useful for enterprising reporters!
Government websites are now orchards full of apples for the picking. Let’s try OutWit out on the Disciplinary Board of the Supreme Court of Pennsylvania.
We see a table of recent actions taken by the state Supreme Court against lawyers. The table has each attorney’s name, county and the action against him or her (suspension, disbarment):
Here’s how the webpage looks in OutWit Hub:
OutWit has an options tree on the left. See “data” on the tree? Look underneath it and click “tables”.
Voila! OutWit has extracted the data from the <table> for you:
We chose “tables” because the table of recent court actions is, well, an HTML <table>. But if the data were contained in an HTML list (i.e. <ul> or <ol>) rather than a <table>, then we would choose “lists” instead.
Either way, you can now select all the data by pressing ctrl-a. If you don’t want everything, hold down shift and click just the rows that you want.
After making your selection, click “catch”:
Now it’s time to sort and filter your data. Click the icon that looks like a miniature spreadsheet with a triangle:
Uncheck the data columns that you don’t want:
Finally, click “export,” then choose a file format for your data:
OutWit can save data in several popular file formats. Here are some samples: csv, txt, html, JSON.
Other uses for OutWit Hub
The free version of OutWit Hub limits you to extracting 100 rows of data. That’s what I use, but it hasn’t been a hindrance.
OutWit can help non-reporters, too! Sports fans can scrape their favorite teams’ scores off a website. History buffs can take China’s timeline off Wikipedia.
What websites would you like to scrape? Got suggestions? Please comment below!
0 Comments
1 Pingback