IDCT Bartosz Pachołek

Cookies Warning

This website uses cookies: We inform you that this site uses own, technical and third parties cookies to make sure our web page is user-friendly and to guarantee a high functionality of the webpage and for statistics tracking. In order to use some (contact form) parts of the website you need to accept the use of cookies.

Parsing huge XML files with PHP

Published 2020-01-20

For quite a long time I used to work in a company where one of my tasks was to parse often enormous data files: usually XMLs, but also totally custom text files. Let us avoid the topic of custom files and stick to the XMLs for now. The environment in which we have developed was the one which we all love and hate: PHP. I know at least few people who would say that probably it is not the best for data processing, yet it did its job. Like mentioned already, files were often huge: over 4GB and we tried to stick to a memory regime of 1.5GiB per process as multiple ones were running in the same time. This required to find the most efficent ways, to utilise the memory in a smart way in order to obtain parsed data in the shortest possible time.

PHP ways

PHP offers many ways of handling XML files: SimpleXML, XMLReader, Mixed: XMLReader + SimpleXML, SAX Parser, DOM, XMLReader + DOM (any even more, like SDO, but we shall skip this one as its intention was different): each of those has its pros and cons. In this article I will show you the benefits and problems of using all of those and a method which is a result of my experiments and which I have found to be the most efficent for my needs.

Input data

In my work I have operated on files which contained details about particular objects: lodgings, hotels, rooms, vacation houses etc. additionally often received separate files with information about vacancies and prices. For demonstration purposes let us generate a random file with random vacancies info for next days for N amount of lodgings in a very simple XML structure:

<?php
//filename: generate.php

//generates a random XML with vacancies info on particular date for hotel rooms
$rooms = 100000;
$datesPerRoom = 1000;

$pregeneratedDates = [];
$date = new DateTime();
for ($i = 0; $i < $datesPerRoom ; $i++) {
    $date->modify('+1 day');
    $pregeneratedDates[] = $date->format('Y-m-d');
}

echo "<hotel>";
echo "\t<rooms>";
for ($i = 0; $i < $rooms; $i++){
    echo "\t\t<room id=\"$i\">";
        foreach($pregeneratedDates as $date) {
            echo "\t\t\t<vacancy date=\"$date\">" . mt_rand(0,1) . "</vacancy>\n";
        }
    echo "\t\t</room>";
}
echo "\t</rooms>";
echo "</hotel>";

Execution of this code as follows:

php generate.php > random.xml

Should output you a ~4GiB XML file which we are going to use for our testing purposes.

Measurement

We are going to measure: memory consumption and execution time. The code should at least for a moment store for EACH vacancy information three variables: $objectId, $date and $value, so we are going to test only reading capabilities.

Time is easy to measure: just compare the timestamps between finish and start, but measurement of memory usage is more complicated. We cannot simply operate on values from memory_get_peak_usage(true); as this does not include Resources and therefore would not be true for in example SimpleXML. Instead we are going to extract the data from our operating system's monitoring tools. To do so we are going to use the code below:

//memcheck.php
/**
 * Gets peak memory usage of a process in KiB from /proc.../status.
 *
 * @return int|bool VmPeak, value in KiB. False if data could not be found.
 */
function processPeakMemUsage()
{
    $status = file_get_contents('/proc/' . getmypid() . '/status'); 
    $matches = array();
    preg_match_all('/^(VmPeak):\s*([0-9]+).*$/im', $status, $matches);  
    return !isset($matches[2][0]) ? false : intval($matches[2][0]);
}

The code above will work on any Debian based OS: like Ubuntu for example (Does not work on Windows with Debian installed using WSL 1, hopefully WSL 2 which is coming soon will support this).

I will perform the measurement on a machine with 16GiB (over 8GiB always free) of RAM.

I have measured first how much memory takes just execution of PHP without any actual logic - it is 63MiB in my configuration, so that will be our tare weight.

SimpleXML

SimpleXML is known for being very fast and very friendly to use for the developer, the prefix Simple is not without a reason here. Yet it has its limitations: because it actually loads the WHOLE provided XML and exposes it as an object it requires a lot of memory. It is more then possible that I will not even be able to load the 4GB file into memory using just SimpleXML, but let us try...

<?php 
//run php.
include "memcheck.php";
$start = time();
$xmlObject = simplexml_load_file('random.xml');

Execution:

php run.php

Sadly after few seconds, when there was no free memory left: Segmentation fault. Technically, maybe, if we could use more memory then SimpleXML would still be the fastest...but since it could not fit in 8GB while we have a regime of 1.5GB then we can disqualify it at this point: yet do not lose your hope on this interface, SimpleXML will be more than useful later.

DOM

Let us check if direct approach with DOM will even manage to load the huge XML file or fail similar to SimpleXML.

<?php
include "memcheck.php";
$start = time();

$doc = new DOMDocument();
$doc->load('random.xml');

It took much longer, balanced around 8GiB for quite a while...and finally failed the same way as SimpleXML, yet we shall elaborate a bit more about Document-Object Model usage later int the article.

SAX Expat Parser

SAX is quite an antique parser, not based on libxml2, but had some interesting concepts: like parsing in chunks. This parsers is mostly forgotten by most younger developers, but actually grants out of the box many features which are not present in SimpleXML nor in XMLReader. The main problem about it is the need to remember states: operating with data like with a state machine - this will be visible in the sample code. This is the first one which actually guarantees that will manage to load the file so we need to write the actual testing code.

<?php
include "memcheck.php";
$start = time();

//we open a handle to the file
$stream = fopen('random.xml', 'r');

//create the actual parser
$parser = xml_parser_create();

/* like mentioned: we operate from top to bottom and consider 
each incoming data as a potential state changer, therefore we 
need some variables to store the actual states */

//will be set when we hit `room` starting tag from attributes
$lastObjectId = null; 

//will be set when SAX reads the contents between the nodes
$lastContents = null; 

//will be set when we hit the `vacancy` start tag
$lastDate = null; 

/*
 ! end tag of vacancy is the moment when we have all the required data for 
 each entry in such configuration
*/

/* like mentioned: SAX Parser works in quite an unusual way as its origin 
dates to PHP4. We need to pass methods (as in methods' names) which will 
handle occurences of: start tag, end tag and literal contents (texts 
between start and end tags) */
xml_set_element_handler($parser, "startTag", "endTag");
xml_set_character_data_handler($parser, "contents");

//now the nice part of sax parser: we load the chunks of data into the parser and it does the trick of joining and parsing whenever needed
while (($data = fread($stream, 16384))) {    
    xml_parse($parser, $data); // parse the current chunk
}
xml_parse($parser, '', true); // finalize parsing
//free any memory used by the parser
xml_parser_free($parser);
//close the stream
fclose($stream);

/**
 * function which handles the start tag and therefore 
 * can access the xml attributes
 * 
 * in our case we want object id and date from attributes
 */
function startTag($parser, $name, $attrs) {   
    /* in this prototype we use globals,
    but it is possible to use object context
    by informing SAX that it reside in an 
    entity with `xml_set_object` */
    global $lastDate, $lastObjectId; 
    switch($name) {
        case 'ROOM':        
            $lastObjectId = $attrs['ID'];
        break;    
        case 'VACANCY':
            $lastDate = $attrs['DATE'];
        break;
    }    
}

/** 
 * function which is executed when an end tag is hit
 * in our case the moment when we have all data of 
 * a particular vacancy (as required: object id, vacancy
 * date) is the moment when we read the contents of the
 * vacancy node, but for a bit of clarity in the code
 * we shall actually handle it when we hit the end tag 
 */
function endTag($parser, $name) {
    global $lastObjectId, $lastContents, $lastDate;
    switch ($name) {
        case 'ROOM':
            //here we would finalise room's data...
        break;
        case 'VACANCY':
            $objectId = $lastObjectId;
            $date = $lastDate;
            $value = $lastContents;
            //here we would process the data
        break;
    }    
}

/** 
 * function which handles the literal (string) contents of
 * a node
 */
function contents($parser, $data) {
    global $lastContents;
    $lastContents = $data;
}

var_dump("Mem in MiB: " . round((processPeakMemUsage() / 1024)));
var_dump("Time in seconds:  " . (time() - $start));

Execution with a buffer of 16384 outputs:

string(14) "Mem in MiB: 63"
string(21) "Time in seconds:  140"

That is pretty amazing already, almost no memory used (be sure to remember the "tare weight" which was mentioned above, also 63MiB). So let us try with a higher buffer size...

With 64MiB (67 108 864 bytes):

string(15) "Mem in MiB: 257"
string(19) "Time in seconds:  4"

The result is even more amazing...

So I have tried to increase the memory even more:

  • 256 MiB

    string(15) "Mem in MiB: 833"
    string(20) "Time in seconds:  11"
  • 512 MiB

    string(16) "Mem in MiB: 1601"
    string(20) "Time in seconds:  20"
  • 32 MiB

    string(15) "Mem in MiB: 113"
    string(19) "Time in seconds:  3"
  • Whole file loaded using file_get_contents

    string(16) "Mem in MiB: 4073"
    string(19) "Time in seconds:  3"

There seems to be no gain in constant increasing of buffer size apart from the case when loading the whole file at once, in fact results were worse than with 32MiB. Why? Sadly I have no definitive answer here, yet there are many potential answers: as SAX cannot expect all nodes to finish in the same buffer part and most likely scans the buffered "area" few times and creates lookup tables and as such with a larger buffer size these scans and lookup tables grow together with the size of the area, on the other hand a buffer too low can heavily increase the time due to the amount of small scans over provided texts. Therefore with SAX it is probably a good approach to spend a bit of time to find an optimal buffer size for particular environments and particular problems. Fact worth noting is that it actually managed to load and parse the whole file at once and with great results!

As you can see, after getting thru the a bit unfriendly setup of SAX parser it can be a powerful tool. Still, SAX Parser is nothing more than a smarter text reader which does not provide any smarter xml-handling than placing attributes into an array.

XMLReader

The crown jewel of PHP, the idol of teenagers, the solution to all your problems... Basically that is how usage of XMLReader is often advertised (well I may have exaggerated a bit), yet is it really? It combines what is best out of libxml2 and reading streams (like in SAX), but in the same time at least at few places could have the corners polished a bit better. Let us get into details...

XMLReader allows you to load data in two ways:

  • By pointing to a url or file:

    $reader = XMLReader::open('/opt/files/myfile.xml');
    //...
    $reader = XMLReader::open('https://idct.pl/some_url_loader.php');

    You can even post data with the request by setting the stream context using libxml_set_streams_context.

  • By providing complete xml as string:

    $reader = new XMLReader();
    $reader->XML(/* ...xml contents ... */);

That first approach is good for huge files as it actually does not load the whole file into memory, but has one huge issue: you cannot explicitly set the buffer size as it does not allow passing of streams, but only paths therefore you cannot set anything with stream_set_read_buffer.

XMLReader grants us some additional features: like an easier option to jump betwen elements for example (next function), but as you will see, in a very basic scenario the operation does not differ much from SAX parser:

<?php
include "memcheck.php";
$start = time();

$xml = XMLReader::open('random.xml');
$lastObjectId = null;

while($xml->read()) {
    if ($xml->nodeType === \XmlReader::ELEMENT) {
        switch($xml->depth) {
            case 2: //we are in `room`
                //get the object id 
                $lastObjectId = $xml->getAttribute('id');
            break;
            case 3: //we are in `vacancy
                $date = $xml->getAttribute('date');
                $objectId = $lastObjectId;                
                //now we need to jump one more time to the actual value...
                if (!$xml->isEmptyElement) {
                    $xml->read();
                }
                $value = $xml->value;
            break;
        }
    }
}

var_dump("Mem in MiB: " . round((processPeakMemUsage() / 1024)));
var_dump("Time in seconds:  " . (time() - $start));

Well, sorry, I was wrong: the actual code is much cleaner. Implementation in cases when operating just on single nodes, without any actual objects and information about parent/child data relations required, results in an easier to maintain code.

What is its performance?

root@test:~# php xmlreader.php 
string(14) "Mem in MiB: 63"
string(21) "Time in seconds:  100"

Pretty great, better than SAX with the buffer of 16KiB. Also used no additional memory (remember about tare weight of 63MiB), but hey: I have memory to use! I have 1.5GiB of my regime limit to fill, so is there some way of making that memory useful? And that is the major problem with XML reader... there is none. The only real case to use these additional resources only with XMLReader would be to create an instance of XMLReader for the data parsed by the first XMLReader (for inner nodes). That is why we are going to test test mixed approach in the next examples, but first let us check how it will handle loading of the whole file at once (with file_get_contents):

Sadly could not do it as the function has failed:

PHP Warning:  XMLReader::XML(): Unable to load source data in /root/xmlreader2.php on line 8

and libxml_get_errors returned nothing.

Most likely a problem of the libxml2 itself as execution of the command:

root@test:~# xmllint random.xml           
Killed

fails too.

XMLReader + SimpleXML

We are going to quickly skip thru room elements, but convert actual contents (vacancies) using SimpleXML. Such approach would be especially useful if we would have multiple different attributes (like name, address, contact details, description, etc.) inside each node as it would give us a nice, object-based, interface to the data: but more about that later.

The test code:

<?php
include "memcheck.php";
$start = time();

$xml = XMLReader::open('random.xml');

while($xml->read() && $xml->name !== 'room') {} //jump to first `room` element;

do {
    $node = simplexml_load_string($xml->readOuterXml());
    $objectId = (string) $node['id'];
    foreach($node->vacancy as $vacancyNode) {
        $value = (string) $vacancyNode;
        $date = (string) $vacancyNode['date'];
    }
} while ($xml->next('room'));

var_dump("Mem in MiB: " . round((processPeakMemUsage() / 1024)));
var_dump("Time in seconds:  " . (time() - $start));

Result:

string(14) "Mem in MiB: 65"
string(21) "Time in seconds:  291"

Memory usage fits in 2MiB, but the time went heavily up: in a simple example it can and should be a disqualifying, but we shall elaborate more on SimpleXML usage later, on a more complex example where it can show its power, especially in usability and code maintainability.

Again: we cannot really do much here to use more memory and gain speed: the only option would be to split the xml into parts of few rooms and then parse them combined together under a fake root node with SimpleXML.

XMLReader + DOM

XMLReader comes with method expand which returns current node as an instance of DOMNode. Apart from the fact that it is also much nicer, object oriented way of handling xml nodes it also supports broken xml and is an actual implementation of W3C DOM API. The biggest pros and cons are often just opinions when it comes to a judgement about quality of that interface: for example you will find people who find the model of multiple classes useful and some who find it annoying. Let us simply test it for our simple testcase:

<?php
include "memcheck.php";
$start = time();

$xml = XMLReader::open('random.xml');

while($xml->read() && $xml->name !== 'room') {} //jump to first `room` element;

do {
    $node = $xml->expand();
    $objectId = $node->getAttribute('id');
    foreach($node->childNodes as $child) {
        if( $child->nodeType === 1 ) { //Element nodes are of nodeType 1. Text 3. Comments 8. etc. yes yes I should have used the constants here...
            $value = $child->textContent;
            $date = $child->getAttribute('date');
        }        
    }
} while ($xml->next('room'));

var_dump("Mem in MiB: " . round((processPeakMemUsage() / 1024)));
var_dump("Time in seconds:  " . (time() - $start));

Results:

string(14) "Mem in MiB: 65"
string(21) "Time in seconds:  194"

First conclusion

Okay, so it would be easy to say that XMLReader or SAX are the winners: low memory usage, very fast operation. Well that is right and wrong: let us get just a bit more complex and see what will happen, especially to the quality of our code.

Second input file

Apart from vacancies data, prices etc. usually we had to parse actual objects' information which involves much more, often more dynamic content - to simulate that let us generate a sample file with some random objects using schema

<objects>
    <object>
        <id>123</id>
        <name>random object name 123123</name>
        <services>
            <service>
                <id>123</id>
                <name>service name</name>
            </service>            
            ...
        </services>
        <features>
            <feature>
                <id>123</id>
                <name>feature name</name>
            </feature>            
        </features>
    </object>
    ...
</objects>

and the code to generate:

<?php
//filename: generate2.php

//generates a random XML with vacancies info on particular date for hotel rooms
$objects = 80000;

echo "<objects>\n";
for ($i = 0; $i < $objects; $i++){
    echo "\t<object>\n";
    echo "\t\t<id>$i</id>\n";
    echo "\t\t<name>random name ".mt_rand(0,100000)."</name>\n";
    $services = mt_rand(0,10);
    if ($services > 0) {
        echo "\t\t<services>\n";
        for($j = 0; $j < $services; $j++) {
            echo "\t\t\t<service>\n";
            echo "\t\t\t\t<id>$j".mt_rand(100,200)."</id>\n";
            echo "\t\t\t\t<name>name $j</name>\n";
            echo "\t\t\t</service>\n";
        }        
        echo "\t\t</services>\n";
    } else {
        echo "\t\t<services/>\n";
    }

    $features = mt_rand(0,10);
    if ($features > 0) {
        echo "\t\t<features>\n";
        for($j = 0; $j < $features; $j++) {
            echo "\t\t\t<feature>\n";
            echo "\t\t\t\t<id>$j".mt_rand(100,200)."</id>\n";
            echo "\t\t\t\t<name>name $j</name>\n";
            echo "\t\t\t</feature>\n";
        }        
        echo "\t\t</features>\n";
    } else {
        echo "\t\t<features/>\n";
    }
    echo "\t</object>\n";
}
echo "</objects>\n";

For 80k objects it generates a file around 60MiB so should fit easily into memory using any reader really. Now let us try to parse this file only using XMLReader to see how complex the code will get as with SAX it would get EVEN MORE complex. Again, our target is to have a "moment" in the code when we have all the data about the object in variables.

A sample code using XMLReader which would parse such file could look like this:

<?php
include "memcheck.php";
$start = time();

$xml = XMLReader::open('out.xml');

$lastParent = null;
$currentObject = null;

while($xml->read()) {
    switch($xml->depth) {
        case 1: //we are in `object`       
            if ($xml->nodeType === \XmlReader::ELEMENT) {
                $currentObject = [
                    'services' => [],
                    'features' => []
                ];
            } elseif ($xml->nodeType === \XmlReader::END_ELEMENT) {
                //end of <object> tag
                //here we have all the values in variable `$currentObject`
            }                
        break;
        case 2: //now we are inside the object element
            //now the switching starts: as there are multiple nodes inside we need to actually verify their names
            if ($xml->nodeType === \XmlReader::ELEMENT) {
                switch($xml->name) {
                    case 'id':                                        
                    case 'name':
                        $nodeName = $xml->name;
                        $xml->read(); //go to value
                        $currentObject[$nodeName] = $xml->value;
                    break;
                    case 'features':
                    case 'services':
                        $lastParent = $xml->name;
                    break;  
                }
            }
        break;
        case 3: //we are in the third level: in features or services
            //now we are in `service` or `feature` tag
        break;
        case 4:
            //here we re in <service> or <feature> tag
            $id = null;
            $name = null;

            while ($xml->depth === 4) {
                if ($xml->nodeType === \XmlReader::ELEMENT) {
                    $lastName = $xml->name;
                    $xml->read();
                    switch($lastName) {
                        case 'id':
                            $id = $xml->value;
                        break;
                        case 'name':
                            $name = $xml->value;
                        break;
                    }
                }

                $xml->read();
            }      

            if ($id !== null && $name !== null) {
                $currentObject[$lastParent][strval($id)] = $name;
            }
        break;
    }
}

var_dump("Mem in MiB: " . round((processPeakMemUsage() / 1024)));
var_dump("Time in seconds:  " . (time() - $start));

Of course that is just one potential implementation, you could do it totally different. I have based it on the depth attribute and references to last (parent) tag. Execution of such script is almost instant and takes close to nothing memory:

string(14) "Mem in MiB: 63"
string(19) "Time in seconds:  3"

but one you must admit: the code is no easy to read and after you manage to understand it easily indicates how to complexity would heavily grow with any complexity growth of the data source file itself, especially if parsing of deeper levels would be required.

Therefore in such cases it is at least worth considering for the sake of yourself and your fellow developers to use a more modern XML reader like SimpleXML. File as in the example could easily fit in the memory so maybe we should consider using only SimpleXML? Let us check this:

<?php
include "memcheck.php";
$start = time();

$xml = simplexml_load_file('random5.xml');

foreach($xml->object as $object) {
    $id = (string) $object->id;
    $name = (string) $object->name;
    $features = [];
    foreach($object->features->feature as $feature) {
        $features[(string)$feature->id] = (string) $feature->name;
    }

    $services = [];
    foreach($object->services->service as $service) {
        $services[(string)$service->id] = (string) $service->service;
    }

    //and here we have everything we wanted to acquire for an object...
}

var_dump("Mem in MiB: " . round((processPeakMemUsage() / 1024)));
var_dump("Time in seconds:  " . (time() - $start));

and execution:

string(16) "Mem in MiB: 1114"
string(19) "Time in seconds:  2"

The execution time is also almost instant, memory usage is extremely high when compared to pure XMLReader execution, but still within the memory limit regime. Code is much easier to understand, its complexity would not actually grow with new nodes inside the object. So should I use SimpleXML purely in such cases? I would say: no. Now imagine that your content provider suddenly increased the amount of objects delivered in a single file: you can easily go beyond your limits or resources.

How to overcome this? By again using the mixed approach: it makes the code just a little bit more complex, but makes you secure in terms of resources usage which should remain low.

In our example we shall simply go thru the file using XMLReader and only load actual objects' details into SimpleXML:

<?php
include "memcheck.php";
$start = time();

$xml = XMLReader::open('random5.xml');
//go to the first 'object' element
while ($xml->name !== 'object') { 
    $xml->read(); 
}

do {
    $object = simplexml_load_string($xml->readOuterXml());
    $id = (string) $object->id;
    $name = (string) $object->name;
    $features = [];
    foreach($object->features->feature as $feature) {
        $features[(string)$feature->id] = (string) $feature->name;
    }

    $services = [];
    foreach($object->services->service as $service) {
        $services[(string)$service->id] = (string) $service->service;
    }

    //here again we have all data of an object
} while ($xml->next('object'));

var_dump("Mem in MiB: " . round((processPeakMemUsage() / 1024)));
var_dump("Time in seconds:  " . (time() - $start));

Execution:

string(14) "Mem in MiB: 63"
string(19) "Time in seconds:  6"

As you can see the used memory is still close to nothing (rememeber our tare weight) as the inner xml is very simple, execution time in this example is still very good (low) and code complexity is a fair compromise between one in pure XMLReader and one in SimpleXML: plus in case you do not have any main nodes (like the object one in our example) then complexity will grow only in ~ linear way in case of new attributes (understood as nodes) inside the object nodes.

Conclusion

First of all be sure to understand that all samples above are just examples, in your everyday work you should be more explicit in terms of checking if a particular node exists, verification of types and files handling (like closing them properly) and of course if possible making methods for handling common code parts.

So which solution is the best? As you can see none of the solutions is a universal one. It is a constant balance between resources usage, execution time and code complexity and depending on particular project it may be best to use a different approach. For instance, as you could read above, for files which describe multiple small values, which are really small columns of key => value elements (like the vacancies scenario) it is probably best to avoid loading into memory as it will create huge objects which eat a lot of resources while XMLReader or SAX still provide fairly acceptable (low) code complexity. For files which deliver multiple (but in reasonable amounts) objects with a lot of attributes (like the objects scenario) it may be possible to operate only in memory and therefore fully object-oriented ways, but with a risk of overflowing the resources which is finally solved by a using a compromise between all the three mentioned factors in a form of the mixed approach.