> esc
move open g h home ? help
$ cat /post/code-snippet-using-php-domdocument-remove-nodes.md

Using PHP DomDocument to remove nodes

Historical post, kept for reference only. Written for PHP 5, which has been EOL since January 2019. The php5-tidy package is now php-tidy (or versioned, e.g. php8.3-tidy), and DOMDocument::loadHTML in modern PHP needs explicit libxml flags to suppress HTML5 warnings.

I needed to strip out some DOM nodes from a HTML file. I would use SED but some of the tags are multiline, and SED/regexes really don’t understand HTML/XML and get really confused if you’re using nested tags of the same type. In the end I decided to use PHP’s built in DOMDocument functions. It is fairly strict and refuses to load if the HTML isn’t perfectly formed, so first I ran it through PHP’s tidy - this isn’t installed by default but you can add it in with a:

sudo apt-get install php5-tidy

So first fix the malformed HTML:

<?php$html = file_get_contents("myfile.html");$config = array(	'indent'         => true,	'output-xhtml'   => true,	'wrap'           => 0);$tidy = tidy_parse_string($html, $config, 'UTF8');$tidy->cleanRepair();//And then load it into DOMDocument:$doc = new DOMDocument();$doc->loadHTML($tidy)?>

Then it’s just a matter of ripping out the tags you don’t want. Note how we’re iterating through the $nodes variable - it MUST be done this way if you’re planning on removing the nodes (as I am) because as they’re removed they also disappear from the collection. A foreach will do some odd stuff - probably terminate after the first node, and a for-loop will have you missing every other node. Instead, just remove the first child until there are no children:

<?php$nodes = $doc->getElementsByTagName("script");while ($nodes->length > 0) {    $node = $nodes->item(0);    remove_node($node);}function remove_node(&$node) {    $pnode = $node->parentNode;    remove_children($node);    $pnode->removeChild($node);}function remove_children(&$node) {    while ($node->firstChild) {        while ($node->firstChild->firstChild) {            remove_children($node->firstChild);        }        $node->removeChild($node->firstChild);    }}?>
$ env | grep ^CAT
CAT_01=dev
CAT_02=scripting
$ env | grep ^TAG
TAG_01=php
TAG_02=php-tidy
TAG_03=domdocument
TAG_04=dom
TAG_05=html
TAG_06=xml