Removing Useless Nodes From the DOM

November 21, 2012

For the third article in this series on short-and-sweet functions, I’d like to show you a simple function that I find indispensable, when working with the HTML DOM. The function is called clean(), and its purpose is to remove comments and whitespace-only text nodes.

The function takes a single element reference as its argument, and removes all those unwanted nodes from inside it. The function operates directly on the element in question, because objects in JavaScript are passed by reference – meaning that the function receives a reference to the original object, not a copy of it. Here’s the clean() function’s code:

function clean(node)
{
  for(var n = 0; n < node.childNodes.length; n ++)
  {
    var child = node.childNodes[n];
    if
    (
      child.nodeType === 8 
      || 
      (child.nodeType === 3 && !/\S/.test(child.nodeValue))
    )
    {
      node.removeChild(child);
      n --;
    }
    else if(child.nodeType === 1)
    {
      clean(child);
    }
  }
}

So to clean those unwanted nodes from inside the <body> element, you would simply do this:

clean(document.body);

Alternatively, to clean the entire document, you could do this:

clean(document);

Although the usual reference would be an Element node, it could also be another kind of element-containing node, such as a #document. The function is also not restricted to working with HTML, and can operate on any other kind of XML DOM.

Why Clean the DOM

When working with the DOM in JavaScript, we use standard properties like firstChild and nextSibling to get relative node references. Unfortunately, complications can arise when whitespace is present in the DOM, as shown in the following example.

<div>
  <h2>Shopping list</h2>
  <ul>
    <li>Washing-up liquid</li>
    <li>Zinc nails</li>
    <li>Hydrochloric acid</li>
  </ul>
</div>

For most modern browsers (apart from IE8 and earlier), the previous HTML code would result in the following DOM structure.

DIV
#text ("\n\t")
+ H2
| + #text ("Shopping list")
+ #text ("\n\t")
+ UL
| + #text ("\n\t\t")
| + LI
| | + #text ("Washing-up liquid")
| + #text ("\n\t\t")
| + LI
| | + #text ("Zinc nails")
| + #text ("\n\t\t")
| + LI
| | + #text ("Hydrochloric acid")
| + #text ("\n\t")
+ #text ("\n")

The line breaks and tabs inside that tree appear as whitespace #text nodes. So, for example, if we started with a reference to the <h2> element, then h2.nextSibling would not refer to the <ul> element. Instead, it would refer to the whitespace #text node (the line break and tab) that comes before it. Or, if we started with a reference to the <ul> element, then ul.firstChild would not be the first <li>, it would be the whitespace before it.

HTML comments are also nodes, and most browsers also preserve them in the DOM – as they should, because it’s not up to browsers to decide which nodes are important and which are not. But it’s very rare for scripts to actually want the data in comments. It’s far more likely that comments (and intervening whitespace) are unwanted “junk” nodes.

There are several ways of dealing with these nodes. For example, by iterating past them:

var ul = h2.nextSibling;
while(ul.nodeType !== 1)
{
  ul = ul.nextSibling;
}

The simplest, most practical approach, is simply to remove them. So that’s what the clean() function does – effectively normalizing the element’s subtree, to create a model that matches our practical use of it, and is the same between browsers.

Once the <div> element from the original example is cleaned, the h2.nextSibling and ul.firstChild references will point to the expected elements. The cleaned DOM is shown below.

SECTION
+ H2
| + #text ("Shopping list")
+ UL
| + LI
| | + #text ("Washing-up liquid")
| + LI
| | + #text ("Zinc nails")
| + LI
| | + #text ("Hydrochloric acid")

How The Function Works

The clean() function is recursive – a function that calls itself. Recursion is a very powerful feature, and means that the function can clean a subtree of any size and depth. The key to the recursive behavior is the final condition of the if statement, which is repeated below.

else if(child.nodeType === 1)
{
  clean(child);
}

So, each of the element’s children is passed to clean(). Then, the children of that child node are passed to clean(). This is continued until all of the descendants are cleaned.

Within each invokation of clean(), the function iterates through the element’s childNodes collection, removing any #comment nodes (which have a nodeType of 8), or #text nodes (with a nodeType of 3) whose value is nothing but whitespace. The regular expression is actually an inverse test, looking for nodes which don’t contain non-whitespace characters.

The function doesn’t remove all whitespace, of course. Any whitespace that is part of a #text node which also contains non-whitespace text, is preserved. So, the only #text nodes to be affected are those which are only whitespace.

Note that the iterator has to query childeNodes.length every time, rather than saving the length in advance, which is usually more efficient. We have do this because we’re removing nodes as we go along, which obviously changes the length of the collection.

JavaScript: Novice to Ninja, 2nd Edition