Trích xuất các thẻ HTML và thuộc tính của chúng bằng PHP

Có một số cách để trích xuất các thẻ cụ thể từ một tài liệu HTML. Cách mà hầu hết mọi người nghĩ đến đầu tiên có lẽ là biểu thức chính quy. Tuy nhiên, đây không phải lúc nào cũng là cách tiếp cận tốt nhất – hoặc như một số người vẫn khẳng định, không bao giờ là cách tiếp cận tốt nhất. Regex có thể hữu ích cho các bản hack nhỏ, nhưng sử dụng trình phân tích cú pháp HTML thực sự thường sẽ dẫn đến mã đơn giản và mạnh mẽ hơn. Các truy vấn phức tạp, như “tìm tất cả các hàng có lớp .foo của bảng thứ hai của tài liệu này và trả về tất cả các liên kết có trong các hàng đó”, cũng có thể được thực hiện dễ dàng hơn nhiều với trình phân tích cú pháp tốt.

Có một số (mặc dù rất ít) trường hợp ngoại lệ mà biểu thức chính quy có thể hoạt động tốt hơn, vì vậy tôi sẽ thảo luận về cả hai cách tiếp cận trong bài đăng này.

Trích xuất thẻ với DOM

PHP 5 đi kèm với một DOM API tích hợp có thể sử dụng được mà bạn có thể sử dụng để phân tích và thao tác các tài liệu (X)HTML. Ví dụ, đây là cách bạn có thể sử dụng nó để trích xuất tất cả các URL liên kết từ một tệp HTML:

//Load the HTML page
$html = file_get_contents('page.htm');
//Create a new DOM document
$dom = new DOMDocument;
 
//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);
 
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
 
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
    //Extract and show the "href" attribute. 
    echo $link->getAttribute('href'), '<br>';
}

Ngoài getElementsByTagName(), bạn cũng có thể sử dụng $dom->getElementById() để tìm các thẻ có id cụ thể. Đối với các tác vụ phức tạp hơn, như trích xuất các thẻ lồng nhau sâu, XPath có lẽ là cách tốt nhất. Ví dụ, để tìm tất cả các mục danh sách có lớp “foo” chứa các liên kết có lớp “bar” và hiển thị các URL liên kết:

//Load the HTML page
$html = file_get_contents('page.htm');
//Parse it. Here we use loadHTML as a static method
//to parse the HTML and create the DOM object in one go.
@$dom = DOMDocument::loadHTML($html);
 
//Init the XPath object
$xpath = new DOMXpath($dom);
 
//Query the DOM
$links = $xpath->query( '//li[contains(@class, "foo")]//a[@class = "bar"]' );
 
//Display the results as in the previous example
foreach($links as $link){
    echo $link->getAttribute('href'), '<br>';
}

Simple HTML DOM Parser là một trình phân tích cú pháp HTML thay thế phổ biến cho PHP 5 cho phép bạn thao tác các trang HTML dễ dàng như jQuery. Tuy nhiên, cá nhân tôi khuyên bạn không nên sử dụng nó nếu bạn quan tâm đến hiệu suất của tập lệnh, vì trong các thử nghiệm của tôi, Simple HTML DOM chậm hơn DOMDocument khoảng 30 lần.

Trích xuất thẻ và thuộc tính bằng Regex (Regular Expressions)

Trong khi hầu hết các trình phân tích cú pháp yêu cầu PHP 5 trở lên, thì Regex có sẵn ở hầu hết mọi nơi. Ngoài ra, chúng nhanh hơn một chút so với trình phân tích cú pháp thực khi bạn cần trích xuất thứ gì đó từ một tài liệu rất lớn (khoảng 400 KB trở lên). Tuy nhiên, trong hầu hết các trường hợp, bạn nên sử dụng tiện ích mở rộng PHP DOM hoặc thậm chí là Simple HTML DOM, không nên loay hoay với các biểu thức Regex phức tạp.

Đây là một hàm PHP có thể trích xuất bất kỳ thẻ HTML nào và các thuộc tính của chúng từ một chuỗi nhất định:

/**
 * extract_tags()
 * Extract specific HTML tags and their attributes from a string.
 *
 * You can either specify one tag, an array of tag names, or a regular expression that matches the tag name(s).
 * If multiple tags are specified you must also set the $selfclosing parameter and it must be the same for
 * all specified tags (so you can't extract both normal and self-closing tags in one go).
 *
 * The function returns a numerically indexed array of extracted tags. Each entry is an associative array
 * with these keys :
 *  tag_name    - the name of the extracted tag, e.g. "a" or "img".
 *  offset      - the numberic offset of the first character of the tag within the HTML source.
 *  contents    - the inner HTML of the tag. This is always empty for self-closing tags.
 *  attributes  - a name -> value array of the tag's attributes, or an empty array if the tag has none.
 *  full_tag    - the entire matched tag, e.g. '<a href="http://example.com">example.com</a>'. This key
 *                will only be present if you set $return_the_entire_tag to true.
 *
 * @param string $html The HTML code to search for tags.
 * @param string|array $tag The tag(s) to extract.
 * @param bool $selfclosing Whether the tag is self-closing or not. Setting it to null will force the script to try and make an educated guess.
 * @param bool $return_the_entire_tag Return the entire matched tag in 'full_tag' key of the results array.
 * @param string $charset The character set of the HTML code. Defaults to ISO-8859-1.
 *
 * @return array An array of extracted tags, or an empty array if no matching tags were found.
 */
function extract_tags( $html, $tag, $selfclosing = null, $return_the_entire_tag = false, $charset = 'ISO-8859-1' ){

if ( is_array($tag) ){
$tag = implode(‘|’, $tag);
}

//If the user didn’t specify if $tag is a self-closing tag we try to auto-detect it
//by checking against a list of known self-closing tags.
$selfclosing_tags = array( ‘area’, ‘base’, ‘basefont’, ‘br’, ‘hr’, ‘input’, ‘img’, ‘link’, ‘meta’, ‘col’, ‘param’ );
if ( is_null($selfclosing) ){
$selfclosing = in_array( $tag, $selfclosing_tags );
}

//The regexp is different for normal and self-closing tags because I can’t figure out
//how to make a sufficiently robust unified one.
if ( $selfclosing ){
$tag_pattern =
‘@<(?P<tag>’.$tag.’) # <tag
(?P<attributes>\s[^>]+)? # attributes, if any
\s*/?> # /> or just >, being lenient here
@xsi’;
} else {
$tag_pattern =
‘@<(?P<tag>’.$tag.’) # <tag
(?P<attributes>\s[^>]+)? # attributes, if any
\s*> # >
(?P<contents>.*?) # tag contents
</(?P=tag)> # the closing </tag>
@xsi’;
}

$attribute_pattern =
‘@
(?P<name>\w+) # attribute name
\s*=\s*
(
(?P<quote>[\”\’])(?P<value_quoted>.*?)(?P=quote) # a quoted value
| # or
(?P<value_unquoted>[^\s”\’]+?)(?:\s+|$) # an unquoted value (terminated by whitespace or EOF)
)
@xsi’;

//Find all tags
if ( !preg_match_all($tag_pattern, $html, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE ) ){
//Return an empty array if we didn’t find anything
return array();
}

$tags = array();
foreach ($matches as $match){

//Parse tag attributes, if any
$attributes = array();
if ( !empty($match[‘attributes’][0]) ){

if ( preg_match_all( $attribute_pattern, $match[‘attributes’][0], $attribute_data, PREG_SET_ORDER ) ){
//Turn the attribute data into a name->value array
foreach($attribute_data as $attr){
if( !empty($attr[‘value_quoted’]) ){
$value = $attr[‘value_quoted’];
} else if( !empty($attr[‘value_unquoted’]) ){
$value = $attr[‘value_unquoted’];
} else {
$value = ”;
}

//Passing the value through html_entity_decode is handy when you want
//to extract link URLs or something like that. You might want to remove
//or modify this call if it doesn’t fit your situation.
$value = html_entity_decode( $value, ENT_QUOTES, $charset );

$attributes[$attr[‘name’]] = $value;
}
}

}

$tag = array(
‘tag_name’ => $match[‘tag’][0],
‘offset’ => $match[0][1],
‘contents’ => !empty($match[‘contents’])?$match[‘contents’][0]:”, //empty for self-closing tags
‘attributes’ => $attributes,
);
if ( $return_the_entire_tag ){
$tag[‘full_tag’] = $match[0][0];
}

$tags[] = $tag;
}

return $tags;
}

Ví dụ sử dụng
Trích xuất tất cả các thẻ tiêu đề

$nodes = extract_tags( $html, 'h\d+', false );
foreach($nodes as $node){
    echo strip_tags($link['contents']) , '<br>';
}

Thẻ meta

$nodes = extract_tags( $html, 'meta' );

Hoặc trích xuất các đoạn văn bản in đậm và in nghiêng:

$nodes = extract_tags( $html, array('b', 'strong', 'em', 'i') );
foreach($nodes as $node){
    echo strip_tags( $node['contents'] ), '<br>';
}