Home Home > GIT Browse
diff options
authorFilipe Manana <fdmanana@suse.com>2018-10-15 17:49:58 +0100
committerFilipe Manana <fdmanana@suse.com>2018-10-15 17:49:58 +0100
commit944429a5cae08d82cedd5e6df2e445054316839a (patch)
parentbff757c820ce5d265e7f8c392f1877bdbe234679 (diff)
Btrfs: fix file data corruption after cloning a range and fsync
2 files changed, 108 insertions, 0 deletions
diff --git a/patches.suse/btrfs-fix-file-data-corruption-after-cloning-a-range.patch b/patches.suse/btrfs-fix-file-data-corruption-after-cloning-a-range.patch
new file mode 100644
index 0000000000..bf0b7e8c62
--- /dev/null
+++ b/patches.suse/btrfs-fix-file-data-corruption-after-cloning-a-range.patch
@@ -0,0 +1,107 @@
+From: Filipe Manana <fdmanana@suse.com>
+Date: Thu, 12 Jul 2018 01:36:43 +0100
+Patch-mainline: 4.18
+Git-commit: bd3599a0e142cd73edd3b6801068ac3f48ac771a
+Subject: [PATCH] Btrfs: fix file data corruption after cloning a range and
+ fsync
+References: bsc#1111901
+When we clone a range into a file we can end up dropping existing
+extent maps (or trimming them) and replacing them with new ones if the
+range to be cloned overlaps with a range in the destination inode.
+When that happens we add the new extent maps to the list of modified
+extents in the inode's extent map tree, so that a "fast" fsync (the flag
+BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
+and log corresponding extent items. However, at the end of range cloning
+operation we do truncate all the pages in the affected range (in order to
+ensure future reads will not get stale data). Sometimes this truncation
+will release the corresponding extent maps besides the pages from the page
+cache. If this happens, then a "fast" fsync operation will miss logging
+some extent items, because it relies exclusively on the extent maps being
+present in the inode's extent tree, leading to data loss/corruption if
+the fsync ends up using the same transaction used by the clone operation
+(that transaction was not committed in the meanwhile). An extent map is
+released through the callback btrfs_invalidatepage(), which gets called by
+truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
+later ends up calling try_release_extent_mapping() which will release the
+extent map if some conditions are met, like the file size being greater
+than 16Mb, gfp flags allow blocking and the range not being locked (which
+is the case during the clone operation) nor being the extent map flagged
+as pinned (also the case for cloning).
+The following example, turned into a test for fstests, reproduces the
+ $ mkfs.btrfs -f /dev/sdb
+ $ mount /dev/sdb /mnt
+ $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
+ $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar
+ $ xfs_io -c "fsync" /mnt/bar
+ # reflink destination offset corresponds to the size of file bar,
+ # 2728Kb minus 4Kb.
+ $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
+ $ xfs_io -c "fsync" /mnt/bar
+ $ md5sum /mnt/bar
+ 95a95813a8c2abc9aa75a6c2914a077e /mnt/bar
+ <power fail>
+ $ mount /dev/sdb /mnt
+ $ md5sum /mnt/bar
+ 207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar
+ # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
+ # power failure
+In the above example, the destination offset of the clone operation
+corresponds to the size of the "bar" file minus 4Kb. So during the clone
+operation, the extent map covering the range from 2572Kb to 2728Kb gets
+trimmed so that it ends at offset 2724Kb, and a new extent map covering
+the range from 2724Kb to 11724Kb is created. So at the end of the clone
+operation when we ask to truncate the pages in the range from 2724Kb to
+2724Kb + 15908Kb, the page invalidation callback ends up removing the new
+extent map (through try_release_extent_mapping()) when the page at offset
+2724Kb is passed to that callback.
+Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
+map is removed at try_release_extent_mapping(), forcing the next fsync to
+search for modified extents in the fs/subvolume tree instead of relying on
+the presence of extent maps in memory. This way we can continue doing a
+"fast" fsync if the destination range of a clone operation does not
+overlap with an existing range or if any of the criteria necessary to
+remove an extent map at try_release_extent_mapping() is not met (file
+size not bigger then 16Mb or gfp flags do not allow blocking).
+CC: stable@vger.kernel.org # 3.16+
+Signed-off-by: Filipe Manana <fdmanana@suse.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+ fs/btrfs/extent_io.c | 3 +++
+ 1 file changed, 3 insertions(+)
+diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
+index 062b27621e67..1e46d50373ae 100644
+--- a/fs/btrfs/extent_io.c
++++ b/fs/btrfs/extent_io.c
+@@ -4324,6 +4324,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
+ struct extent_map *em;
+ u64 start = page_offset(page);
+ u64 end = start + PAGE_SIZE - 1;
++ struct btrfs_inode *btrfs_inode = BTRFS_I(page->mapping->host);
+ if (gfpflags_allow_blocking(mask) &&
+ page->mapping->host->i_size > SZ_16M) {
+@@ -4346,6 +4347,8 @@ int try_release_extent_mapping(struct extent_map_tree *map,
+ extent_map_end(em) - 1,
+ 0, NULL)) {
++ &btrfs_inode->runtime_flags);
+ remove_extent_mapping(map, em);
+ /* once for the rb tree */
+ free_extent_map(em);
diff --git a/series.conf b/series.conf
index 3357a468c2..cacf16d607 100644
--- a/series.conf
+++ b/series.conf
@@ -16890,6 +16890,7 @@
+ patches.suse/btrfs-fix-file-data-corruption-after-cloning-a-range.patch