PHP file performance options

Submitted by peter on Thu, 07/26/2018 - 11:19

File performance is important because there is a heap of data in files, you find many occasions when files are a better choice than other storage options, and you may need to supply your data to other people as files. Lets look at file performance considerations and run tests.

Who?
When?
Where?
Why?
Way?
Worth?
What?

Who?

Who makes decisions about using files? Your suppliers might decide to use files and your choose is only how to access the files. Your customers may request files and you can choose only how you will process them. With PHP, there are multiple choices. We will look at some of those choices.

When?

When do you get choices about performance? When you have a choice of file formats, you can choose to process files into another format for repeated access. With PHP, you always have choices about the way you read files and the way you write files. For large files, PHP lets you choose between fast reads with massive memory usage or slower reads with small memory usage.

A common example of a file format change is the input of a catalogue/price list from a supplier. You select the data you need and place the data in a database table for frequent access when your customers are shopping. When the input contains images, you can choose to use separate files or aggregate all images for one product into one file. You can change format from something like PNG to JPEG.

Where?

Where can you gain the most performance improvements? Reading and writing offer the most options. You can choose to read/write whole files or segments. Reading a large file as one big chunk may save some processing time but will use a mass of memory, potentially slowing down processing by causing paging.

Writing is usually slower than reading. You could look at the writing code first to gain the biggest improvements.

For Web sites, the writing of a file like a product description might be ones per day while the reading might be thousands of times per day. But the reading might be cached so the actual file reads drop down. You need to look at where the reads and writes occur in the overall processing

Why?

Why are we testing the performance of files instead of databases? Many database packages can store huge blocks of data but they store the data as files instead of database rows. The database rows contain only the location of the file. You can save processing overhead by simply placing the location of the file in the database instead of shovelling the whole file into the database. That also saves database overheads every time you read the file as the database only has to return a string containing the path to the file.

When files are left in their original form, the operating system can automatically cache the file without database overheads, another reason to avvoid files in databases. Plus the OS file cache is automatic, there is no database tuning required.

Why are we testing with lines separated by "\n"? This is the most common situation and the one where you have the most choices. Other formats limit you to one choice and no control over performance.

Way?

How do you make the changes? In PHP, you can read a file in segments using fopen(), fread(), then fclose(). file_get_contents() is faster for small files because everything happens in one instruction. Do you really save significant time? This will be one of our tests.

Compare your processing for a small file against a file containing millions of rows of data. You might need gigabytes of memory to read the file then more gigabytes to process the data from the file. file_get_contents could paralyse your server. The alternative is to use fread to get a small segment of data, process that, then perform another fread. You then test different size segments in fread.

Worth?

Is there a financial cost connected to all of this? The more you test, the more money is wasted on development. Usually some reading, pages like this one, and a small number of tests will give you the right start. You might then repeat some tests if there is a dramatic change in the size of your files.

Memory is cheap but a small increase during testing might indicate a massive requirement when you process files in volume. That extra memory usage can be expensive when you use something like the Amazon cloud.

Cloud costs are variable. Will you use a mass of memory all day or just a few minutes each week? What happens your supplier starts sending big price files every hour instead of once per week?

What?

What are the choices? Files can be finished files or data streams. Files can contain lines or records or rows or documents with different structures. We will look at regular files. Some of the file processing options can handle streams and you are often limited to fewer options. Documents, lines, rows, records, there is a structure or a delimiter you have to know. Some structures limit your options.

Streams are read with the assumption that you do not know the maximum size. You read the file in sections to avoid waiting forever for the end of file and to avoid flooding memory with giant files. If a stream has a guaranteed small size then you have to include code to handle the error case where someone sends you the wrong file and it is giant.

Lines of data in files are typically indicated by a line end character, defined in code as "\n". Apple messed up a lot of their code with random use of "\r", "\r\n", "\n\r", and "\n". Way back before the start of modern computing, everyone used "\r\n". The official Email standard adopted "\r\n". PHP uses "\n".

Some file formats use lengths instead of delimiters and other structures. You have to read the file in segments and decode the data yourself. Our tests will use lines indicated by "\n".

My tests do a few things to indicate real life results. We are using lines in files to get the most options. The files vary through 1 row, 10 rows, 100, 1000, 10,000, 100,000, and 1,000,000. Every test will run three times to highlight variations based on environmental changes. The results can be translated to other file structures.

Memory usage tests often remain constant across similar tests because PHP compiles the code at the start and allocates most of the memory required. All compiled languages work the same. You have to work with huge amounts of data to see memory usage variations. Sometimes you have to measure the usage externally to include things like file buffers.

Very short tests can vary in speed by a huge amount because the time depends mostly on the external environment loading things. When you run a test three times back to back, the first time can be double the other times due to things loading the first time and being in memory the next time. You will also see variations between a freshly booted computer and a computer that has run for some time. The measurements from a fresh boot are usually hardware dependant. Measurements from running systems depend on everything that ran between the last boot and the current test.

There are ways to even out tests. Before you time a file access, you can access the file at the operating system level to get the file directory information cached by the operating system. For our tests, I include a filesize() before the accesses. There are still big variations between the fastest and the slowest test when repeated several times in the same code execution.

fread()

You can read a file using fopen(), fread(), and fclose(). fread() can read the whole file in as a single string or read segments of a size specified by you. The fixed segment size helps you read a large file in small chunks. Lets see what we can learn using fread().

We test with 100 byte lines because there are lots of examples around that size and the mental calculation of file size is easy using 100. We test with files containing 1 line, 10 lines, 100, 1,000, 10,000, 100,000, and 1,000,000 lines. Files beyond that size are usually video files and not something we would process with regular file code.

The line could represent a fix length record. You can access fixed length records randomly but you need to know the location of each record. You might start by reading the whole file to build an index. This file performance information will help with the index build.

For small files, the environment causes most delays

Reading the whole of a one line file typically requires 20 milliseconds on my machine. Running several tests of the same read in the same code produces variations between 20 and 30 milliseconds. Think about that. In just 20 milliseconds, on a system that is not busy, the environmen can introduce 10 milliseconds of delay.

On a freshly booted machine with almost nothing else running, the time jumps up to 234 milliseconds for the first read then drops down to 15 milliseconds for the second run of the same test. 15 milliseconds is the actual open, read, and close time when there is nothing else running. The other 219 milliseconds of the first read is the time used by the operating system and file system, the time taken to load the directory nodes for the file path into memory, the time to work out access rights, and all the other overheads.

On a busy system, your file accesses might be fast due to most things being cached in memory. A similar system might be very slow due to other processes flooding the file cache. To test, run read tests across the day with each read test running several times in quick succession. The first read will include all the overheads and the second read will use the cached data. This will giveyou an idea of your file system efficiency across the day. You may need to increase the operating system file cache size.

Small increases in file length have little effect

Reading 10 lines as one string adds only 5 milliseconds, showing how little effect small amounts of data have on the processing compared to the overheads of opening files. When you have to read a file then change it, you can save time by opening the file for read + write, then reading then writing then closing.

Significant increases in file length have a significant effect

A hundred line file read added only 5 milliseconds while a thousand line file read added over 70 milliseconds. This is the point where the operating system had to go back to the file system to read more segments from disk. The elapsed time switches from file open overheads to file input/output delays. You have to work out what is significant on your system.

What is significant in your system?

What is significant Disks used to store data in 256 byte blocks then they switched to 1k segments internally. SSD and RAID can internally use segments of 64k or more. Reading one extra byte from a file might make the software and hardware process an extra segment. If the whole segment is cached, you can then read the next byte from memory. When a cache is flooded, the segment may be dropped and your next one byte read will drag the whole segment back in to memory.

There are memory caches in the hardware, the disk, and in controllers, RAID hardware controllers, and in some file systems and in most operating systems. When disk cache is flooded, you can experiment with moving files across disks to balance activity. RAID controller memory is too slow and expensive. Extra memory for your main processor is usually the best option. You can then change the operating system file cache settings if the operating system does not automatically use the extra memory.

Decrease memory use

For each test, the increase in memory usage matched the file size. When I tested two reads of the same file into a string variable with the same name, the memory usage doubled. PHP did not run through a garbage collection cycle to free up the memory from the first use of the variable. There are effectively two lines of code saying $contents = fread(). What can we do to save memory?

Some of my applications, read a whole file, extract relevant data, then read several similar whole files to extract related data. They will eat memory until PHP decides to run garbage collection to free up memory no longer used. You can do that manually. The following code example shows what you might do to minimise memory usage when multiple large file reads chew up memory.

$contents = fread($file, $length);

fclose($file);

// Process $contents.

unset($contents);

gc_collect_cycles();

An alternative is to read one line at a time. What happens when we try to save memory by reading one line at a time? Our test lines are fixed at 100 bytes so I tested reading in a loop 100 bytes per fread().

There was little difference for small files. Some tests ran faster in some instances because the variation in overheads from the system were more than the small time used to read the data. The best read time for a 10 line file one line at a time was equal to the best read time for reading the whole of a 100 line file but the variations between tests are bigger than the difference between the two approaches. Plus you do not save much memory.

For a large file, the overheads of multiple small reads were significant, taking 1.5 times longer and up to 4 times longer in some tests. For real life use, you would use a read segment larger than 100 bytes, perhaps a megabyte at a time, and the overheads would not be significant.

Data is stored on disk in segments of 4K, 4,096 bytes. If we read 40 lines, 4,000 bytes, at a time, we would be close to a segment at a time but not exactly one segment. Some reads would cross two segments, resulting in two reads. For maximum efficiency at the file system, you would read something several times larger than the segment size.

file_get_contents() is similar to fread() when fread() reads the whole file in one read. We look at file_get_contents() next.

file_get_contents()

This series of tests uses file_get_contents() to read the whole file in as one string then the file one line at a time and should be similar to the previous fread().

The same for small files

For small files, file_get_contents() is almost exactly the same as fread(). The variations between the two functions is less than the variation you get when you run the test multiple times. Internally they should be the same. For some reason, most of my tests shows file_get_contents() as slightly slower. I suspect the difference is file_get_contents() offering different options with the result that the internal checks take longer.

Slow for large files

For large files, file_get_contents() should be almost exactly the same as fread(). In my tests using a lightly loaded system, file_get_contents() is similar to fread for a single read of the whole file.

file_get_content() can using memory mapping in some operating systems. Memory mapping is supposed to be faster but is unpredictable and sometimes a disaster. Slow performance with memory mapping is more obvious on heavily loaded systems. If you do not need the whole file, fread() has a predictable stable performance.

When you do not need the whole file, you can add an offset to file_get_contents() for the location you want to read and a length for the read. file_get_contents() is incredibly slow when using offset and length. Do not use file_get_contents() to read a whole file bit by bit.

Some of my tests ran repeatedly to ensure consistency. On other occasions, I ran tests after a reboot when most resources are not loaded into memory. I found file_get_contents() on a big file could slow down while the system juggled memory allocations. The variations in file_get_contents() performance are always worse than with fread().

One my machine, a file larger than ten megabytes is a good candidate for the fread() approach. My file oriented applications read multiple files with several files approaching gigabytes. file_get_contents() does not work for those files.

file()

file() reads a file into an array with the data split at each line end, using the Unix line end, a newline or "\n". The lines are returned in an array. file() is as fast as fread() for small files and, like file_get_contents(), does not require an open or a close. file() is a good choice when you need text data as an array.

The array uses more memory

For medium to large files, there is a rapid increase in memory usage. The array uses 1.6 times more memory than the string you get from fread(). Do not use file() when you do not need the array.

Overall, the memory usage is less than reading the whole file as a string then manually splitting the file into an array. I tested the idea with file_get_contents reading into a string then a PHP string function splitting the string into an array. Memory usage was 2.6 times the file size, 1 times for the string plus 1.6 times for the array.

You can save memory with fread() and manual processing when you do not need every line in the array. You can read the file line by line and add to the array only the lines you need.

Processing can be slower

Part of the file() processing is splitting the data at line ends and part is building the array. file() is probably faster than reading the file as a string then splitting out the lines yourself, although my tests using explode() showed a similar time, suggesting that internally PHP is using similar code. What my tests show is a far bigger variation in processing time for file() compared to fread(). The array processing appears to be more variable than the file reading.

PHP array processing is fast for small arrays then slows down for large arrays. Anything you can do to minimise an array could useful. The manual array build would not help if you are dropping only a few records, say 20 percent. To make a difference with manual code, you would have to drop half or more of the incoming data.

Write

What are the performance overheads when writing?

A small write is four times longer than a read

The overhead of creating a new file is four times the overhead of opening an existing file. After the file is created, writing a few lines with fwrite() uses exactly the same speed as reading a few lines with fread().

A large file can be slower

Using repeated tests with large files, some writes were similar to reads while others were slower. When a file runs over a certain size, it is too big to create in one contiguous section of disk and fragments. Allocating those fragments adds time to the write.

On most systems, fragmentation is effectively random because the fragmentation depends on every other file creation from back when the file system was created. There is no defragmentation for most file systems. If all your file writes are slow, upgrade to a bigger disk. The extra space will reduce defragmentation.

You can also get defragmentation programs for some file systems but the effect does not last when your disk is nearly full. Depending on the file system, you may need 20% or more of your file system empty to allow quick writes of new files.

(Most system managers will argue that their favourite OS/file system is "magic" and does not need defragmentation. Tests show every file system slows down due to fragmentation when the disk is more than about 60% full. The only difference is how fast the fragmentation becomes obvious.)

fwrite or file_put_contents()?

fwrite() and file_put_contents() produce similar write times. Any extra overhead of file_put_contents() is tiny compared to the overheads of creating a new file.

To save memory in some situations, you might build a new file using fwrite() to add data segment by segment. In my tests with big files written 100 bytes at a time, fwrite() ran five times longer than the equivalent fwrite() of one big string. You are continually bashing your head against the overhead of extending a file. For a file process like that, you would save significant time by collecting many lines of data into one larger write.

Read + write

The overhead of opening a file is significant for small files. Think about a small file where you read the file then change one value or add one line. fopen() lets you open a file for reading and writing, mode 'r+'. You can use fread() to read the whole file then fwrite() to add an extra line then close the file. You save enough time to be significant when there are many similar file modifications.

When multiple applications want the same file, you have to introduce file locks and they are slow. To avoid locks, you might read the file then add the extra line as an append using file_put_contents(). You might write multiple separate files then use a separate process to merge all the files into one file. With multiple accesses to the same file, your first choices are about safety and reliability, not performance.

Other options

fgetc() reads one character at a time. That sounds slow. You might use it for a very slow stream of data. You could then switch off the stream the instant you get the data you need.

fgets() reads one line at a time from a file. You use this when you want lines from a file and they are of variable length. For small files, it is quicker to get the lines in an array using file_get_contents(). For large files, you will save some file processing overheads by reading big segments of data into a buffer then manually splitting the lines.

fscanf() lets you search a file for a string, giving you the option of reading or writing from that point. I have not found fscanf() useful. To work out which data you use, you need to look at the whole file. You can then perform your own search in memory and display the data either side of the search target to confirm your search works.

File based databases

fseek() lets you jump to any place in a file. I mentioned reading a whole file to build an index to each item/record. You could then find the item you want in the index and use fseek() to jump to the location of the item in the main file. You might use this approach when there are many documents in a big file and the index contains the document names. Find the name in the index to get the location in the file then read the document from the file.

This is how some file based databases work. The data files are too big to read into memory. The indexes are small enough to read in as arrays. You find what you want in the array then read the data file. The documents/items/records/rows in the data files might be thousands of bytes long. The identifiers might be only a few bytes each. You are looking at in memory indexes one thousandth of the data size.

Structure first

Look at data structure first, not performance. Then look at the accesses you need, not the performance. What do you need in memory at the same time? Minimising the data you read and write will give you the best performance no matter what code you use.

After you map out the data and the range of access you need, you can pick the highest performance code for the task. The only part you need test is the optimum segment size for reads and writes when a file is too big to read in one hit and the lines/records are too small to read efficiently one at a time.