Schlagwort-Archiv: Extract Content

Scraping Web Data with Rapidminer

After my last post about the chracteristics of Bundesliga players‘ body data by position I have been asked whether there is a relationship between the height of players or teams and their tactics on the field. For example, is a team with taller forwards more likely to make use of crosses and headers to score? While I’m still working on this topic, I thought it would be nice to show how I build a dataset of all Bundesliga goals in the past season to answer this question.

So here is a short introduction to scraping web data with Rapidminer. My goal: Build a dataset including all goals of the last Bundesliga season including additional information such as the kind of assist which preceded it. A good data source is Transfermarkt.de, which offers a game sheet for every match.

For a few matches, the relevent data can be extracted by hand. The problem arises when you plan to collect data for a whole season. So here is how I did it, step by step.

Preparing a list of websites to scrape

  1. The first thing to do is to build up a collection of pages that contain the information you’re looking for. Transfermarkt offers a season overview containing all matches and links to their respective game sheets.
  2. The next step is to view the source code of the the page which contains all the links.
  3. Copy the html-code to Excel or any other spreadsheet application.
  4. You’ll realize that you’ve copied thousands of lines of code. In this case only 306, the total sum of matches per season, are of interest. A good procedure to separate the lines containing valueable information is to sort the whole table document. Having a unique structure, the relevant lines will be concentrated in one section, while all others lines can be deleted.
  5. When only the relevant lines of code are left, the next step is to separate the relevant links from the remaining html-structure. A good way to do this is to use quotation marks as separators.
  6. Having deleted all irrelevant columns, the last step is to add the domain name common to all links in front of them, in this case „www.transfermarkt.de/“ In Excel you can merge strings by using „&“. The result is a list of all html pages to scrape which can be used in Rapidminer.

Scraping with Rapidminer

From here on I assume, that you have a basic understanding how Rapidminer works and how processes can be designed. So I won’t start with the absolute basics. If you don’t have already, you should now install Rapidminer along with the newest version of the webmining package.

At the end, your main process should look like this:

  1. After having opened a new project in Rapidminer, the first thing to do is to make use of the „Read Excel“ operator. It will read the link spreadsheet line by line and submit the websites to the following operator. All operators can be searched in the operators section on the left side. In the parameters section on the right you only have to provide the path to your file and the information whether the first row contains headlines. The import wizard provided should be useful. Important: Make sure the attribute of the variable containing the links is set to „file path“.
  2. Next you have to add the „Get Pages“ operator to your process and connect it to the „Read Exel“ operator. The only thing to do here is to define the name of column wich contains the links in the spreadsheet.
  3. The third operator to use is „Data to Documents“. Connect it to the output of „Get Pages“.
  4. Connect the document output to the „Process Documents“ operator. It is important that the keep text option is checked. Otherwise there will be no text to extract the data from.
  5. Double click on the „Process Documents“ operator. You now are at a lower level where you can set the nested operations that will take place within the „Process Documents“ operator. The operators combined inside should look like this afterwards:
  6. First operator in here is „Extract Content“ which separates text and html tags. The minimum text block length defines how long the extracts (tokens) have to be at least. You should set the length depending on the the content you want to extract. If you set it to one, all text will be extracted.
  7. This step is optional. You can use the „Filter tokens“ operator to decide wich pieces of text you want to keep. You can determine tokens to keep by giving the operator a string by which it is filtered. The standard setting is that tokens containing the defined string are kept. But there is also the possibility to invert the filter by selecting the checkbox in the parameters section. I used this to exclude all social media text like „share on twitter“ etc. But this can be done later as well.
  8. The next step is optional too. With the „Cut Document“ operator you can cut your text in parts by selecting strings as starting and ending sequences. The use of this operator makes sense, if you are only interested in a particular part of the text. If a text is well structured, like the game sheets on Transfermarkt.de, every type of information will be found in it’s own section. For example all information of a match’s goals are available between its own header „Tore“ („goals“) and „Wechsel“ („substitutions“) as the header of the next block of information. If you apply this operator, only text between the matching strings will be kept. It is possible to define a great number of matching strings. In the resulting data set each extracted section can be identified by  the name you labeled it with.
  9. Now you can return to the main process. The final step here is to connect the output of the „Process Documents“ operator to the „Write Excel“ operator. Select a directory and a document type, and Rapidminer will write your dataset in an Excel file.

Data Jiu-Jiutsu

The rest of the work can be done in Excel again.

  1. Depending on whether you cut one or more sections from the text, your dataset will contain the number of cut section X the number of pages you scraped. By sorting the spreadsheet by the label (query key attribute) assigned to each different section, you can easily select the ones you want and copy them to another table.
  2. The last steps to create your data set is data jiu-jiutsu. Everybody has different ways to handle it. In my case, I had to think a while before I realised what might be a good solution to get my data in shape, because there where no separators in the text to distinguish goal events.  Finally I substituted all score information of goals by simply adding a leading comma. For example „0:1“ was converted to „,0:1“. By just adding a comma for every „digit-:“ combination. This was done within a minute. The rest is a lot of reshaping.

I hope this tutorial will be useful to somebody, If there are further questions, don’t hesitate to ask them in the comments section.

by Tobias Wolfanger