SSD Linux Tweaks

2012-12-05 (updated: 2015-02-23) by Philip
Tags: Linux, ssd

Solid State Drives have already become one of the best computer upgrades money can buy. Most current OSes add some functionality to accommodate SSDs (like TRIM), however, there is much left for the user to manually enable and tweak. This article explores all the important aspects of tuning Linux for best possible SSD performance by enabling TRIM support, choosing the right filesystem type, aligning partitions, reducing unnecessary small disk writes for longer SSD endurance, and more.

Partitions Alignment

Proper alignment of partitions is very important for SSDs, as it avoids excessive read-modify-write cycles. Older partitioning tools were optimized for hard disk drives (with disks, cylinders and heads) rather than SSDs with different NVM page and block sizes. Newer partitioning tools typically align partitions to start at 1MB marks, This covers all common SSD page and block size scenarios, as it is divisible b 1MB, 512KB, 128KB, 4KB and 512 bytes. Most recent Linux distributions already take into account SSDs and align partitions correctly, however, you may want to check to make sure. We will assume that your SSD is at /dev/sda in the following examples.

Check your partitons on /dev/sda with the command:

fdisk -lu /dev/sda

Note that there are 2048 clusters in a Megabyte. For each partition in the output of this command, look at the "Start" column and make sure it is divisible by 2048 (2048 clusters in a MB). If the result is an integer, your partition is aligned properly to 1MB.

To see your SSDs physical and logical block size, look at the following pseudo files:

cat /sys/block/sda/queue/logical_block_size
cat /sys/block/sda/queue/physical_block_size

Enable TRIM

With SSDs, whenever you write data to the disk, it must first erase any data in the sectors it's writing to, slowing the drive. Modern operating systems and SSDs support TRIM to keep the SSD from slowing down over time.

To verify TRIM support on your drive, use the following (where /dev/sda is your particular SSD, "df" command shows mount points) :

# hdparm -I /dev/sda | grep TRIM

* Data Set Management TRIM supported (limit 4 blocks)

There are generally three different ways to implement TRIM under modern Linux distributions:

1) discard (/etc/fstab mount option) - this performs TRIM in real-time after each file delete. Many distros don't enable this by default in fstab for SSDs. There seems to be a general consensus the "discard" option is not a very good method, as it is resource intensive and introduces some unnecessary performance issues.

2) fstrim via cron job - this seems to be the preferred method currently, by adding the fstrim command to a weekly, or even monthly cron job. The only disadvantage of this method is that after a reboot, the fstrim command would perform a trim on the whole filesystem. It has something to do with the fact that the record of what has been trimmed is kept in kernel memory and volatile. Below is a sample script that can be put in the /etc/cron.weekly directory to do a trim on all drives that support it (set permissions to be executable):

#! /bin/sh
fstrim -a

Some older versions of fstrim may not support the -a parameter, in such cases you will have to specify the mount points to trim manually ("/" and "/boot", in this example):

#! /bin/sh
for mount in / /boot; do
fstrim $mount
done

2) sustemctl enable fstrim.timer - this is a relatively new systemd option, similar to fstrim via a cron.job. It performs a weekly fstrim on the system, and can be checked with systemctl status fstrim.timer

Partitions to keep away from the SSD

You can keep most of your partitions to take advantage of the SSD speed, including /, /boot, /boot/efi, etc. However, you should try to keep swap partitions, temporary files and log files that constantly write away from SSDs to increase their lifespan.

When creating partitions, simply put your swap partition on a HDD instead of the SSD. You can usually skip creating a swap partition when installing Linux if you have plenty of memory, however, you should also reduce "swappiness" to zero so the OS never attempts to swap to disk. it may be a better idea to simply reduce swappiness instead.

To create a bash script that adjusts swappiness at boot time:

1) navigate to (create, if necessary): /etc/rc.d/rc.local
make sure the file is executable (chmod +x /etc/rc.d/rc.local), and this bash script will run every time the system boots. The first line, as with every bash script should be:
#!/bin/bash

2) add the following line to it
echo 0 > /proc/sys/vm/swappiness

The above ensures that the OS will only try to use a swap file when all RAM is exhausted. The default value for swappiness is 60, and lower numbers mean the OS will use the swap file less.

A more sensible solution may be to simply create the swap partition on a HDD, and reduce swappiness somewhat, i.e. use a value of 30.

Directories to move away from the SSD

It is also a good idea to keep temporary files away from the SSD to reduce writes. To move temporary files to RAM, edit /etc/fstab and add the following line to it:

tmpfs /tmp tmpfs nodev,nosuid,noexec,mode=1777 0 0

Other good candidates for moving from the SSD to RAM (or another HDD) may be browser cache files (for desktop system), and /var/log, if you can live with all system logs being volatile and not surviving reboots. This is most useful for NAS or embedded applications. There are some subdirectories and files in /var/log that may have to be recreated each time you reboot, however. This can be accomplished with a couple of lines in the /etc/rc.d/rc.local file we already created/edited, here is an example:

### directories to recreate at bood time
for dir in anaconda audit httpd mail ntpstats ppp samba squid sssd ; do
mkdir -p /var/log/$dir
done
mkdir -p /var/log/samba/old

Alternatively, you can move the /var/log directory (and possibly /var/cache, /var/spool) to another drive to reduce writes to the SSD. To accomplish this, the easiest way is using a symbolic link to the new location. Assuming that you have a HDD mounted at /mnt/hdd1 , do:

# stop the syslog.service
systemctl stop syslog.service

# create a directory on /mnt/hdd1
mkdir /mnt/hdd1/var

# move the directory structure and contents of /var/log over to the new directory
mv /var/log /mnt/hdd1/var/

# create a symbolic link from /var/log to the new location
ln -s /mnt/hdd1/var/log /var/log

It may be a good idea to reboot at this point, so that other programs that may be writing to /var/log work properly. What we did here is, instead of reconfiguring every service that may possibly want to write to /var/log, we just redirect them to a new location using a symbolic link. That way any new programs that try to log to directories under /var/log will still work as expected. You can move /var/cache and /var/spool using the same method:

mv /var/cache /mnt/hdd1/var/
ln -s /mnt/hdd1/var/cache /var/cache

mv /var/spool /mnt/hdd1/var/
ln -s /mnt/hdd1/var/spool /var/spool

Notes: SELinux may have some issues with these symbolic links. There is an alternative of using "mount --bind /mnt/hdd1/var/log /var/log" to create new mount points instead of symbolic links, but that is usually reserved

Tune Fstab Filesystem Options

Most mount options, filesystem types, /tmp directory and other boot options in modern linux systems are configured in the /etc/fstab file.

Backup fstab before making any changes, so you can easily recover from errors:

#cp /ets/fstab /etc/fstab.bak

Modify /etc/fstab mount parameters for your SSD partition by adding the following options separated by commas:

noatime - do not update access time for each accessed file/directory to reduce disk writes (automatically implies nodiratime) . This works well for both HDDs and SDDs. If you need to keep access times functionality, using relatime is a good compromise (only causes atime write if the file has been modified since last being accessed).

discard - enables TRIM support with ext4 and kernels 2.6.33 or later

commit=30 - delays/buffers writes to disk for up to 60 seconds. It may be a problem if power interruption is likely. The default value is 5 seconds, you can use a number up to a couple of minutes depending on the likelihood of power interruption (you may lose up to N seconds of work, though most of the time this won't happen as software can still sync the data to disk overwriting the commit setting). This reduces disk writes and increases performance by combining writes into one single larger write, and cancelling updates to previous writes within the commit time frame.

Journaling Mode
When using a journaling file system like ext4, and with an increased commit time, it is important to consider the correct journaling mode to correctly leverage data safety vs. drive performance. It is prudent to use data=ordered for some additional safety with longer commit intervals, instead of data=writeback, for example.

data=writeback - no data journaling (but metadata is journaled). A crash/recovery cycle can cause incorrect data to be in files updated shortly before the crash. Best performance of all journaling modes, and less writes increases SSD longevity.

data=ordered - journals metadata, but orders metadata changes with the data blocks into "transactions". When a write is done, the associated data blocks are written first. This journaling method is a good compromise between performance and data safety.

data=journal - all data and metadata is journaled, slow.

Example /etc/fstab :

SSD Linux fstab

Notes:
Using the 'defaults' option implies all the following options: rw,suid,dev,exec,auto,nouser,async . Options are read left-to-right, so if you use 'defaults' after one of those seven options it will get trumped by 'defaults'. For this reason, when using 'defaults' it may be a good idea to put it in the beginning of the options string.

Currently mounted filesystems, along with their corresponding mount options can be viewed using simply: #mount

Change the I/O Scheduler

Linux has three main kernel I/O schedulers: CFQ, DEADLINE, and NOOP. The default is CFQ, SSDs can benefit from using either DEADLINE or NOOP instead, as outlined below. The task of an I/O scheduler is mainly to group, reorder, and merge I/O operations when possible. One of the main intents is to decrease disk seeks. Flash devices have negligible seek times, and therefore can benefit from switching to a different I/O scheduler.

CFQ (Completely Fair Queuing) - this is the default scheduler since kernel 2.6.18 and has been designed to deal with the rotational latencies of spinning platter drives. Ir places synchronous requests submitted by processes into a number of per-process queues and then allocates timeslices for each of the queues to access the disk. It prioritizes the number of requests and the length of the timeslice depending on the process priority.

DEADLINE - this scheduler does some sorting to guarantee read requests take priority over write, which is useful to guarantee read responsiveness under heavy writes. The DEADLINE scheduler imposes a deadline to every I/O operation. It uses multiple queues to store operations and sorts them according to their deadline. By committing to these deadlines, it can gruarantee no I/O starvation during normal CPU loads. It is a good fit if you are worried about I/O starvation, and if you'd like to keep sorted read/write queues. It is better than CFQ for SSD drives.

NOOP - the NOOP scheduler uses the least CPU cycles, as it is a simple FIFO (first-in first-out) queue and implements request merging, without any reordering. It is very battery-friendly. It is a great match for laptops, SSDs, USB flash drives, and any flash media where there is negligible seek penalty.

To view the currently used scheduler for "sda", for example, execute:

cat /sys/block/sda/queue/scheduler

To change the scheduler for a specific drive ("sda" and "sdb" in the example below), add the following to /etc/rc.d/rc.local (to be applied at boot time):

# changes the I/O scheduler for sda to NOOP
echo noop > /sys/block/sda/queue/scheduler

# changes the I/O scheduler for sdb to DEADLINE
echo deadline > /sys/block/sdb/queue/scheduler

Memory Disk Write Buffers

The Linux kernel has a number of tunable memory write buffers that define how the system uses memory to delay disk writes. You can control how often the OS writes old "dirty" data to disk, how aggressively to use the swap file, etc. A number of pseudo files under /proc/sys/vm/ control how all this works.

/proc/sys/vm/laptop_mode - determines how many seconds after a read should a writeout of changed files start (this is based on the assumption that a read will cause an otherwise spun down disk to spin up again). This delays writes to disk (initially intended to allow laptop disks to spin down while not in use, hence the name)/proc/sys/vm/dirty_writeback_centisecs - how often the kernel should check if there is "dirty" (changed) data to write out to disk (in centiseconds)./proc/sys/vm/dirty_expire_centisecs - how old "dirty" data should be before the kernel considers it old enough to be written to disk. /proc/sys/vm/dirty_ratio - the maximum amount of memory (in percent) to be used to store dirty data before the process that generates the data will be forced to write it out. Setting this to a high value should not be a problem as writeouts will also occur if the system is low on memory. /proc/sys/vm/dirty_background_ratio - the lower amount of memory (in percent) where a writeout of dirty data to disk is allowed to stop. This should be quite a bit lower than the above dirty_ratio to allow the kernel to write out chunks of dirty data in one go.

Increase the time it takes memory to write to disk by adding the following to the bottom of the /etc/sysctl.conf file (to be applied at boot time):

# dirty_ratio is the max percent of memory to use (default is 20)
vm.dirty_ratio = 30
# vm.dirty_writeback_centicecs = 500 (default 5 sec) buffer/delay disk writes
vm.dirty_writeback_centisecs = 1500
# vm.dirty_expire_centisecs = 3000 (default 30 sec) buffer/delay disk writes
vm.dirty_expire_centisecs = 4500
# vm.swappiness (default 60, use 0 to 60) smaller values reduce swap usage. Already applied in rc.local above
# vm.swappiness = 20

Notes:
Do not add comments on the same lines as parameters, or you may get errors applying the parameters at boot.
To see currently applied vm timing paramteres, use: sysctl -a | grep dirty
To process sysctl.conf without reboot, use: sysctl -p /etc/sysctl.conf
To make sure kernel variables are successfully applied at boot: systemctl status systemd-sysctl.service
Alternatively, all those same settings can be added to the /etc/rc.d/rc.local file instead (with slightly different syntax, i.e.: echo 0 > /proc/sys/vm/swappiness )

Check total disk writes with S.M.A.R.T.

Newer SSDs have long lifespan, except maybe smaller TLC drives. Still, it is a good idea to know what your average daily writes, and life expectancy of the drive is. You may want to check the total bytes written to the SSD in Linux by using the following shell command that displays S.M.A.R.T. data (assuming your SSD is at /dev/sda):

smartctl -A /dev/sda

Look for line 241, that says something like:

Samsung: "Total LBAs Written" (value is in LBAs) - multiply by 512 to get total bytes written, RAW_VALUE * 32 / 1024 / 1024 / 1024 = total Gigabytes written.
Intel: "Host_Writes_32MiB" (value is in 32 Megabyte blocks) - multiply RAW_VALUE x 32 / 1024 = total Gigabytes written to the disk.

You can use this number to estimate your daily writes, and total life expectancy of the drive.

Other Ideas

For desktop systems, it may be beneficial to move the browser cache to /tmp (if mounted in ram, the only notable drawbacks being that it is volatile and shared between user accounts). You may also want to take a look at iotop and look at what processes write to disk the most.

Here is another good candidate to be added to the /etc/rc.d/rc.local file as well (to be executed at each boot):

# dont do kernel crashdumps - reduces disk writes and wakeups
echo 0 > /proc/sys/kernel/nmi_watchdog

Windows users may want to check out our Windows SSD Speed Tweaks article as well.

Post your review/comments

rate: avg: