OutWit Hub is a Web browser built specifically for extracting data from webpages. This is very useful for enterprising reporters!

Government websites are now orchards full of apples for the picking. Let’s try OutWit out on the Disciplinary Board of the Supreme Court of Pennsylvania.

We see a table of recent actions taken by the state Supreme Court against lawyers. The table has each attorney’s name, county and the action against him or her (suspension, disbarment):

The Disciplinary Board of the Supreme Court of Pennsylvania

 

Here’s how the webpage looks in OutWit Hub:

Screenshot from OutWit Hub

 

OutWit has an options tree on the left. See “data” on the tree? Look underneath it and click “tables”.

Voila! OutWit has extracted the data from the <table> for you:

OutWit Hub, has a tree diagram of options on your left. Find “data” and, underneath it, click on “tables.”

 

We chose “tables” because the table of recent court actions is, well, an HTML <table>.  But if the data were contained in an HTML list (i.e. <ul> or <ol>) rather than a <table>, then we would choose “lists” instead.

Either way, you can now select all the data by pressing ctrl-a. If you don’t want everything, hold down shift and click just the rows that you want.

After making your selection, click “catch”:

outwit_hub_03

 

Now it’s time to sort and filter your data. Click the icon that looks like a miniature spreadsheet with a triangle:

outwit_hub_04

 

Uncheck the data columns that you don’t want:

outwit_hub_05

 

Finally, click “export,” then choose a file format for your data:

outwit_hub_06

 

OutWit can save data in several popular file formats. Here are some samples: csv, txt, html, JSON.

Other uses for OutWit Hub

The free version of OutWit Hub limits you to extracting 100 rows of data. That’s what I use, but it hasn’t been a hindrance.

OutWit can help non-reporters, too! Sports fans can scrape their favorite teams’ scores off a website. History buffs can take China’s timeline off Wikipedia.

What websites would you like to scrape? Got suggestions? Please comment below!