How to import big spreadsheets

November 30, 2023

My first approach (it didn't work)

The initial strategy involved using PHPOffice/PHPSpreadsheet, a popular library for spreadsheet operations. However, this approach proved problematic for large files due to its memory usage being high. While optimizations like setReadDataOnly(true) and iterators helped somewhat, the process remained slow and memory-intensive.

Bypassing PHP entirely and using MySQL directly offered superior performance. However, the default MySQL configuration presents a security barrier through the secure_file_priv option, which restricts file imports to designated directories.

To enable this functionality, add the following to your server's my.cnf file:

# my.cnf
[mysqld]
  secure_file_priv = '/path/to/directory/with/CSV_files'

An additional limitation: MySQL direct imports cannot enforce business rules or validate data before insertion—validation must occur afterward.

MySQL can import CSV files

The LOAD DATA command enables MySQL to parse CSV files and populate tables directly:

LOAD DATA INFILE '/path/to/directory/with/CSV_files/my-big-spreadsheet.csv'
    INTO TABLE my_table
    (column1, column2, column3);

How to handle a big file?

Never attempt to open massive files with text editors. Instead, use command-line tools designed for this purpose.

Viewing file samples: The head and tail commands display opening and closing lines respectively. Use the -n flag to specify quantity: head -n 10 my-big-spreadsheet.csv.

Splitting files: Divide CSV files into manageable chunks using the split command:

split -l 1000000 my-big-spreadsheet.csv my-big-spreadsheet-

This creates files with 1 million lines each.

Testing

Never test with the full dataset. Create a sample: head -n 3000 my-big-spreadsheet.csv > my-sample.csv

If termination becomes necessary, use:

SHOW PROCESSLIST;
KILL QUERY <id>;

Skip the header row

Add IGNORE 1 LINES to your import statement:

LOAD DATA INFILE '/path/to/directory/with/CSV_files/my-big-spreadsheet.csv'
    INTO TABLE my_table
    IGNORE 1 LINES
    (column1, column2, column3);

Line endings, column separator, and enclosing character

During one import operation, only half of the rows were loaded due to Windows line endings (\r\n) conflicting with the Linux server's expected format (\n).

Determine line ending types using cat -e:

head -n 10 my-big-spreadsheet.csv | cat -e

This reveals: ^M$ = Windows carriage return + newline, $ = Linux newline only.

The complete query:

LOAD DATA INFILE '/path/to/directory/with/CSV_files/my-big-spreadsheet.csv'
    INTO TABLE my_table
    COLUMNS TERMINATED BY ';'
    OPTIONALLY ENCLOSED BY '\"'
    LINES TERMINATED BY '\r\n'
    IGNORE 1 LINES
    (column1, column2, column3);

Counting rows

Verify imported row counts using wc -l my-big-spreadsheet.csv.

For tables with incremental primary keys, SELECT MAX(id) FROM my_table executes faster than counting all rows.

Bonus tip: is the table empty?

SELECT * FROM my_table LIMIT 1;

An empty result confirms the table is ready for import.

Originally published on norday.tech.