Screen Scraping HTML with PHP and Regular Expressions
I had a lot of success in developing several website scraping scripts, with the method of parsing by matching regular expressions. Using scraping as a method of obtaining data is always a bad idea. It is obvious whatever system you're scraping isn't supporting your efforts whatsoever. So assuming the process of doing it isn't copyright infringement or otherwise illegal, the code will inevitably require a lot of maintenance. That being said, let's scrape.
This is the HTML for a link that is the result of a Google search:
<a class="l" onmousedown="return clk(this.href,'res','1','')" href="http://www.microsoft.com/"><b>Microsoft</b> Corporation</a>
Now say for instance we want to scrape all the href properties, and the text display of the link. I believe Google has a webservice that would the correct way to access this data, but that is just a reminder of why you should never do this. I have gone about the parsing using tags as tokens to look for, but ignoring their properties in matches. In most instances properties contain more specific display information about a page, while the tags outline the structure of the page and should be subject to less change. With that in mind, the following is a regular expression that will match the entire text of any anchor with an href reglardless of having class or onmousedown attributes.
/<a [^>?]href="(?P<ANCHOR_TEXT>)"*>(?P<ANCHOR_TEXT>)<\/a>/m
The best way to match the first tag in a markup language is to use the expression
[^>]. It will allow any text beside the closing greater than sign to match there, i.e. the first two attributes I am ignoring. The question mark specifies the RegEx to evaluate non greedy. Doing this for a single node expression like this will prevent some spurious matches. The final noteworthy aspect of this snippet is the final m after the expression, which specifies to match multiple lines of text.
Here is what PHP executing this might appear as:
$pattern = "/<a [^>?]href=\"(?P<ANCHOR_TEXT>)\"*>(?P<ANCHOR_TEXT>)<\/a>/m";
preg_match_all($pattern, $myHtml, $result, PREG_SET_ORDER);
The parsers I've done have been running for months with little maintenance, what I think I owe that to is my extension of this method to match larger portions of HTML with a single regular expression. This ensures the validity of the data matched, because the likelyhood that the exact same HTML structure would exist with the wrong data in it is low. For example, i would parse a segment like this:
The usual case was having to parse every row of a data table, so I matched every row and every value in each cell. Once we're done we find it all neatly dumped into an associative array.
The gotcha here is matching not only that whitespace, but also the carriage return and newline characters. To do this use the expression [\s\r\n]* which will respectively match these three things.
This is the HTML for a link that is the result of a Google search:
<a class="l" onmousedown="return clk(this.href,'res','1','')" href="http://www.microsoft.com/"><b>Microsoft</b> Corporation</a>
Now say for instance we want to scrape all the href properties, and the text display of the link. I believe Google has a webservice that would the correct way to access this data, but that is just a reminder of why you should never do this. I have gone about the parsing using tags as tokens to look for, but ignoring their properties in matches. In most instances properties contain more specific display information about a page, while the tags outline the structure of the page and should be subject to less change. With that in mind, the following is a regular expression that will match the entire text of any anchor with an href reglardless of having class or onmousedown attributes.
/<a [^>?]href="(?P<ANCHOR_TEXT>)"*>(?P<ANCHOR_TEXT>)<\/a>/m
The best way to match the first tag in a markup language is to use the expression
[^>]. It will allow any text beside the closing greater than sign to match there, i.e. the first two attributes I am ignoring. The question mark specifies the RegEx to evaluate non greedy. Doing this for a single node expression like this will prevent some spurious matches. The final noteworthy aspect of this snippet is the final m after the expression, which specifies to match multiple lines of text.
Here is what PHP executing this might appear as:
$pattern = "/<a [^>?]href=\"(?P<ANCHOR_TEXT>)\"*>(?P<ANCHOR_TEXT>)<\/a>/m";
preg_match_all($pattern, $myHtml, $result, PREG_SET_ORDER);
The parsers I've done have been running for months with little maintenance, what I think I owe that to is my extension of this method to match larger portions of HTML with a single regular expression. This ensures the validity of the data matched, because the likelyhood that the exact same HTML structure would exist with the wrong data in it is low. For example, i would parse a segment like this:
<tr class="dataRow" height="15px">
<td class="dataCell">5/21/2001</td>
<td class="dataCell">Active</td>
<td class="dataCell">75.6</td>
<td class="dataCell">CLASS B</td>
<td class="dataCell">foo</td>
<td class="dataCell">bar</td>
<td class="dataCell">
baz
</td>
</tr>
The usual case was having to parse every row of a data table, so I matched every row and every value in each cell. Once we're done we find it all neatly dumped into an associative array.
The gotcha here is matching not only that whitespace, but also the carriage return and newline characters. To do this use the expression [\s\r\n]* which will respectively match these three things.
0 Comments:
Post a Comment
<< Home