IDCT Bartosz Pachołek

Cookies Warning

This website uses cookies: We inform you that this site uses own, technical and third parties cookies to make sure our web page is user-friendly and to guarantee a high functionality of the webpage and for statistics tracking. In order to use some (contact form) parts of the website you need to accept the use of cookies.

PHP + MySQL: Bulk import, best methods

Published 2020-01-21

Data processing often ends up with storing of the data in some kind of a database, often that task is given to a MySQL-compatible server, for example MariaDB or MySQL Community Server. Our favorite environment, which is PHP, offers numerous ways of connecting to such servers: with MySQLi client or PDO, recently more often DBAL fromt the Doctrine ecosystem. Depending on the actual and particular environment different interface may be better or worse, but this article is not about exploration of libraries, instead we shall try to find the most efficent way of storing processed data in the database. For scenarios when you operate on small batches of data this may not be an important topic, as usually it is irrelevant if an INSERT operation takes 80 or 95 miliseconds, yet when you need insert millions of entries then numerous aspects need to be considered.

Input data

In order to find out the best way of handling millions of inserts we need some test data, we shall take the vacancies xml from one of my previous articles: https://idct.pl/article/parsing-huge-xml-files-with-php

You can find there a code to generate a file with objects' ids, dates and information about a vacancy available (0 or 1). For the testing purposes I have reduced the the numbers: 10000 objects and 300 dates per object, this results in 3000000 data records.

Output

To store input data we shall create a simple table with three columns which represent the imported data:

CREATE TABLE `article_test` (
  `object_id` int(8) NOT NULL,
  `date` datetime NOT NULL,
  `vacancy` tinyint(1) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

I have also set one index - on the object_id

ALTER TABLE `article_test`
  ADD KEY `object_id` (`object_id`) USING BTREE;
COMMIT;

as indexing impacts how imported data is handled internally: especially in terms of execution time.

Used InnoDB for support of transactions.

Connection

For our testing purposes we shall use PDO. I have created a user article_test who has access to database article_test using password ...article_test. Therefore connection line looks as follows:

$db = new PDO('mysql:host=localhost;dbname=article_test',
    'article_test',
    'article_test');

Testing methodology + Trivial approach

First of all we shall include data parsing time in all of the tests here as for some scenarios we simply cannot skip it, therefore the only fair option is to include it always.

The simplest approach is just to create a prepared statement:

$stmt = $db->prepare('insert into article_test (object_id, date, vacancy) values (:object_id, :date, :vacancy)');

and execution during the loop which loads the data which we want to insert as database records, using execute.

In our scenario the requirement is to import all-or-none therefore let's use transactions mechanism:

$db->beginTransaction();
....
$db->commit();

to avoid the impact of caching and slower operation due to growing amount of data on the results we shall clear the table before each test:

$db->exec('truncate article_test');

Each test shall be commenced three times and we shall take the best result for each to avoid a case when results were affected by the testing environment.

Code of the first test:

<?php
$xml = XMLReader::open('data.xml');
$lastObjectId = null;

$db = new PDO('mysql:host=localhost;dbname=article_test',
    'article_test',
    'article_test');

$db->exec('truncate article_test');

$start = time();
$stmt = $db->prepare('insert into article_test (object_id, date, vacancy) values (:object_id, :date, :vacancy)');
$db->beginTransaction();

$i = 0;
while($xml->read()) {
    if ($xml->nodeType === \XmlReader::ELEMENT) {
        switch($xml->depth) {
            case 2: //we are in `room`
                //get the object id 
                $lastObjectId = intval($xml->getAttribute('id'));
                ++$i;
                if ($i % 1000 == 0) {
                    echo "Objects progress: " . $i . "\n";
                }
            break;
            case 3: //we are in `vacancy
                $date = $xml->getAttribute('date');
                $objectId = $lastObjectId;   
                //now we need to jump one more time to the actual value...
                if (!$xml->isEmptyElement) {
                    $xml->read();
                }
                $value = intval($xml->value);
                $stmt->bindParam(':object_id', $objectId, PDO::PARAM_INT); 
                $stmt->bindParam(':date', $date, PDO::PARAM_STR);
                $stmt->bindParam(':vacancy', $value, PDO::PARAM_BOOL);
                $stmt->execute();
            break;
        }
    }
}

$db->commit();
$time = time() - $start;
var_dump('everything took: ' . $time);

Such approach utilizes the least memory on application's side while pushing the pressure of maintaining resources on the DBMS.

Results:

  • 251s
  • 247s
  • 258s

Final result: 251s.

Since in the next tests we shall drop type information on the prepared statement we should be fair and execute the test again without the usage of this functionality.

Warning: this approach has its behavior very different with particular engines: with MyISAM, which does not support transactions it would cause not only short locks and every call of execute but also rebuilding of the index the same time. Be aware that with MyISAM, to avoid any interference during the import process you should lock the table and disable indexing until the process has finished. MyISAM will of course not grant you the rollback mechanism.

Therefore the final code looks as follows:

<?php
$xml = XMLReader::open('data.xml');
$lastObjectId = null;

$db = new PDO('mysql:host=localhost;dbname=article_test',
    'article_test',
    'article_test');

$db->exec('truncate article_test');

$start = time();
$stmt = $db->prepare('insert into article_test (object_id, date, vacancy) values (?, ? ,?)');
$db->beginTransaction();

$i = 0;
while($xml->read()) {
    if ($xml->nodeType === \XmlReader::ELEMENT) {
        switch($xml->depth) {
            case 2: //we are in `room`
                //get the object id 
                $lastObjectId = intval($xml->getAttribute('id'));
                ++$i;
                if ($i % 1000 == 0) {
                    echo "Objects progress: " . $i . "\n";
                }
            break;
            case 3: //we are in `vacancy
                $date = $xml->getAttribute('date');
                $objectId = $lastObjectId;   
                //now we need to jump one more time to the actual value...
                if (!$xml->isEmptyElement) {
                    $xml->read();
                }
                $value = intval($xml->value);
                $stmt->execute([ $objectId, $date, $value ]);
            break;
        }
    }
}

$db->commit();
$time = time() - $start;
var_dump('everything took: ' . $time);

Results:

  • 256s
  • 249s
  • 262s

Final result: 249s.

Therefore it looks like types information / verification does not actually impact on performance.

Potential optimization

In case your tables rely on checking of foreign keys or unique indexes you can disable them as InnoDB checks them still row-by-row, even during transaction: From http://dev.mysql.com/doc/refman/5.7/en/innodb-foreign-key-constraints.html:

InnoDB checks foreign key constraints immediately; the check is not deferred to transaction commit.

First of all it is good to disable AUTOCOMMIT, because it performs a log flush to disk for every insert: SET autocommit=0. If you do not use transactions then be sure to reactive the mode with COMMIT command until you commence with other queries with the same connection.

Especially if we have the data already verified you can disable foreign and unique keys checking: SET unique_checks=0 SET foreign_key_checks=0

Be sure to re-enable them after import: SET unique_checks=1 SET foreign_key_checks=1

Yet this does not matter much in our test case as we do not actively use any of the advanced features. So how to improve import times?

In MyISAM you can also disable/enable keys (indexing while adding):

$db->exec("ALTER TABLE `article_test` DISABLE KEYS;");
...
$db->exec("ALTER TABLE `article_test` ENABLE KEYS;");

Single query approach

With MySQL syntax we can create a one big query for our import yet this generates few new problems: first of all we need to actually create the big string variable which holds the sql, second problem is the amount of bound variables: it may affect the behavior of the mysql driver (one potential solution is to drop bound variables and use preparsing techniques in the code which is generally not suggested), third issue is the query size limit in mysql server, defined by max_allowed_packet attribute.

Let's try this:

<?php
$xml = XMLReader::open('data.xml');
$lastObjectId = null;

$db = new PDO('mysql:host=localhost;dbname=article_test',
    'article_test',
    'article_test');

$db->exec('truncate article_test');

$start = time();

//$stmt = $db->prepare('insert into article_test (object_id, date, vacancy) values');
$query = 'insert into article_test (object_id, date, vacancy) values ';

//will store the collected values
$values = [];

$i = 0;
while($xml->read()) {
    if ($xml->nodeType === \XmlReader::ELEMENT) {
        switch($xml->depth) {
            case 2: //we are in `room`
                //get the object id 
                $lastObjectId = intval($xml->getAttribute('id'));
                ++$i;
                if ($i % 1000 == 0) {
                    echo "Objects progress: " . $i . "\n";
                }
            break;
            case 3: //we are in `vacancy
                $date = $xml->getAttribute('date');
                $objectId = $lastObjectId;   
                //now we need to jump one more time to the actual value...
                if (!$xml->isEmptyElement) {
                    $xml->read();
                }
                $value = intval($xml->value);
                $values[] = $objectId;
                $values[] = $date;
                $values[] = $value;
            break;
        }
    }
}

//now we need to create a query which will have the same amount of bound variables
$variables = count($values) / 3; 
$query .= join(',', array_fill(0, $variables, '(?, ?, ?)'));

$db->beginTransaction();
$stmt = $db->prepare($query);
$stmt->execute($values);
$db->commit();
$time = time() - $start;
var_dump('everything took: ' . $time);

First execution finsihed with an error for me:

PHP Warning:  Error while sending QUERY packet. PID=10673

and that is because my max_allowed_packet was set to 16MiB. After increasing it to 64MiB results are:

  • 31s
  • 28s
  • 31s

Amazing, right? Well yes and no: be sure to remember that you actually need to a) store the values in big variable (or variables) before sending to the DBMS, b) you are limited by max_allowed_packet and c) you would need to incorporate chunking in case of fixed maximum allowed packet in the DBMS to which you export.

Is that the best solution?

That depends. Based on what MySQL documentation says here: https://dev.mysql.com/doc/refman/5.7/en/insert-optimization.html we should actually use the LOAD DATA formula as it can be even up to 20x faster!

LOAD DATA with fputcsv

Basically LOAD DATA statement loads a prepared CSV file into the database. This makes it not worth the effort for single or few queries, but gives potential for larger ones: yet you will require access to any temporary disk storage. Let us try, first we need to create the data file using fputcsv:

<?php
$xml = XMLReader::open('data.xml');
$lastObjectId = null;

$db = new PDO('mysql:host=localhost;dbname=article_test',
    'root',
    'haslo1');

$db->exec('truncate article_test');

$start = time();

$file = fopen('out.csv', 'w');

$i = 0;
while($xml->read()) {
    if ($xml->nodeType === \XmlReader::ELEMENT) {
        switch($xml->depth) {
            case 2: //we are in `room`
                //get the object id 
                $lastObjectId = intval($xml->getAttribute('id'));
                ++$i;
                if ($i % 1000 == 0) {
                    echo "Objects progress: " . $i . "\n";
                }
            break;
            case 3: //we are in `vacancy
                $date = $xml->getAttribute('date');
                $objectId = $lastObjectId;   
                //now we need to jump one more time to the actual value...
                if (!$xml->isEmptyElement) {
                    $xml->read();
                }
                $value = intval($xml->value);
                fputcsv($file, [ $objectId, $date, $value ]);
            break;
        }
    }
}

fclose($file);

//now we need to create a query which will have the same amount of bound variables

$db->beginTransaction();

$statement = <<<HEREDATA
LOAD DATA LOW_PRIORITY INFILE "out.csv"
INTO TABLE article_test
CHARACTER SET utf8
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
HEREDATA;

$db->exec($statement);
$db->commit();
$time = time() - $start;
var_dump('everything took: ' . $time);

Results:

  • 39s
  • 38s
  • 34s

Longer than single big query, why? Answer here is simple: hard disk and I/O usage: we actually need to save data into a file and buffering capabilities of PHP IO operations is very unreliable, as stream_set_write_buffer actually does nothing with local files (at least during time of writing of this article), you can try this:

$file = fopen('out.csv', 'w');
stream_set_write_buffer($file, 16384);

So where is the benefit over single big query? You operate on a file therefore your PHP app is less likely to hit any potential memory limits. What are the problems? The biggest ones are:

  • disk storage access required
  • special privileges: if the file is on the same machine as the server then your user needs to have the FILE privilege
  • can be slowed down by disk's performance

Are there any ways to resolve the problems with disk IO?

Yes.

LOAD DATA with fputcsv in memory

If you have root access to the system on which you store the file you can create a ramdisk:

mkdir /mnt/ramdisk
mount -t tmpfs -o size=512m tmpfs /mnt/ramdisk

and later store actual file on that ramdisk. This should help a bit, but is there any PHP-only solution? Yes, there is.

LOAD DATA with CSV file buffered

Simply instead of writing each file directly to disk write it first to a memory stream and store it using a single operation at the end.

In order to do it first create the memory stream:

$file = fopen('php://memory', 'w');

before you actually close it, flush the buffer and copy it over to a file:

$realfile = fopen('out.csv', 'w');
fflush($file);
rewind($file);
stream_copy_to_stream($file, $realfile);
fclose($realfile);
fclose($file);

Results:

  • 19s
  • 20s
  • 19s

Amazing, right? Such approach has of course the same flaw as big query: it requires you to store the file in memory until writing is finished. This can be especially problematic in a case when you cannot fit everything even for a moment in the memory: then you would need to incorporate flushing to a physical file during the import.

Is there any way we can further improve? Yes, we can (~~~ Barrack Obama).

PHP DB CSV Writer component

You can use my PHP Db Csv writer which is based on another of my libs: buffered Csv Writer. It allows creating of csv files in memory while flushing parts to disk when buffer gets hit. Available on Github: https://github.com/ideaconnect/php-db-csv-writer

To install it I suggest to use composer:

composer require idct/php-db-csv-writer

include the autoloader and create an instance of the component:

include "vendor/autoload.php";
use IDCT\CsvWriter\DbCsvWriter;
$dbCsvWriter = new DbCsvWriter();

Now open a collection (basically a new csv file):

$dbCsvWriter->startCollection('out', ['object_id', 'date', 'vacancy']);

Pass the PDO object:

$dbCsvWriter->setPdo($db);

and optionally set the buffer size:

$dbCsvWriter->setBufferSize(33554432); //32 MiB

To write data into this collection call:

$dbCsvWriter->appendData([ ... data ... ]);

and to finally store it:

$dbCsvWriter->storeCollection($tableName);

For our test file the code will look as follows:

<?php
include "vendor/autoload.php";

$xml = XMLReader::open('data.xml');
$lastObjectId = null;

$db = new PDO('mysql:host=localhost;dbname=article_test',
    'article_test',
    'article_test');

$db->exec('truncate article_test');
$start = time();
use IDCT\CsvWriter\DbCsvWriter;
$dbCsvWriter = new DbCsvWriter();
$dbCsvWriter->startCollection('out', ['object_id', 'date', 'vacancy'])
            ->setPdo($db, false)
            ->setTmpDir('./')
            ->setBufferSize(1024*1024); //1 MiB

$i = 0;
while($xml->read()) {
    if ($xml->nodeType === \XmlReader::ELEMENT) {
        switch($xml->depth) {
            case 2: //we are in `room`
                //get the object id 
                $lastObjectId = intval($xml->getAttribute('id'));
                ++$i;
                if ($i % 1000 == 0) {
                    echo "Objects progress: " . $i . "\n";
                }
            break;
            case 3: //we are in `vacancy
                $date = $xml->getAttribute('date');
                $objectId = $lastObjectId;   
                //now we need to jump one more time to the actual value...
                if (!$xml->isEmptyElement) {
                    $xml->read();
                }
                $value = intval($xml->value);
                $dbCsvWriter->appendData([ $objectId, $date, $value ]);
            break;
        }
    }
}

$db->beginTransaction();
$dbCsvWriter->storeCollection('article_test');
$db->commit();
$time = time() - $start;
var_dump('everything took: ' . $time);

Results:

  • 28s
  • 29s
  • 28s

With a small buffer of 1MiB.

So a bit faster than single big query while keeping the memory usage under control.

Conclusion

As you can see for bulk imports choosing the right method may have a significant change: we went from 249s to 19s with a realistic value of around 30s for methods which allow to actively manage memory usage. Behavior may vary a bit depending on actual data, hardware (like SSD vs HDD; type of memory) therefore apart from having general knowledge in this topic it is worth performing some profiling in your actual environments to decide about final implementation.