[Midnightbsd-cvs] src [7820] trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs: Illumos 13879:4eac7a87eff2

laffer1 at midnightbsd.org laffer1 at midnightbsd.org
Tue Sep 6 19:50:04 EDT 2016


Revision: 7820
          http://svnweb.midnightbsd.org/src/?rev=7820
Author:   laffer1
Date:     2016-09-06 19:50:03 -0400 (Tue, 06 Sep 2016)
Log Message:
-----------
Illumos 13879:4eac7a87eff2
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable

New loader-only tunables:
vfs.zfs.sync_pass_deferred_free
vfs.zfs.sync_pass_dont_compress
vfs.zfs.sync_pass_rewrite

References:
https://www.illumos.org/issues/3329
https://www.illumos.org/issues/3330
https://www.illumos.org/issues/3331
https://www.illumos.org/issues/3335

Import the zio nop-write improvement from Illumos. To reduce I/O,
nop-write omits overwriting data if the checksum (cryptographically
secure) of new data matches the checksum of existing data.
It also saves space if snapshots are in use.

It currently works only on datasets with enabled compression, disabled
deduplication and sha256 checksums.

IllumOS 13887:196932ec9e6a and 13888:7204b3392a58
3236 zio nop-write

References:
https://www.illumos.org/issues/3236

Add loader(8) tunable to enable/disable nopwrite functionality:
vfs.zfs.nopwrite_enabled

Introduce a new dataset aclmode setting "restricted" to protect ACL's
being destroyed or corrupted by a drive-by chmod.

illumos-gate 13889:a67716f16746
3254 add support in zfs for aclmode=restricted

Update manpage dates in zfs.8 and zpool.8

Modified Paths:
--------------
    trunk/cddl/contrib/opensolaris/cmd/zfs/zfs.8
    trunk/cddl/contrib/opensolaris/cmd/zpool/zpool.8
    trunk/cddl/contrib/opensolaris/cmd/ztest/ztest.c
    trunk/sys/cddl/contrib/opensolaris/common/zfs/zfs_prop.c
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dbuf.h
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/space_map.h
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_impl.h
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
    trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c

Property Changed:
----------------
    trunk/cddl/contrib/opensolaris/cmd/zfs/zfs.8
    trunk/cddl/contrib/opensolaris/cmd/zpool/zpool.8
    trunk/cddl/contrib/opensolaris/cmd/ztest/ztest.c

Modified: trunk/cddl/contrib/opensolaris/cmd/zfs/zfs.8
===================================================================
--- trunk/cddl/contrib/opensolaris/cmd/zfs/zfs.8	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/cddl/contrib/opensolaris/cmd/zfs/zfs.8	2016-09-06 23:50:03 UTC (rev 7820)
@@ -25,7 +25,7 @@
 .\"
 .\" $MidnightBSD$
 .\"
-.Dd November 26, 2011
+.Dd November 26, 2012
 .Dt ZFS 8
 .Os
 .Sh NAME
@@ -748,7 +748,7 @@
 .Tn ACE Ns s
 exist that affect the mode, then the mode is set in accordance to the requested
 mode from the application.
-.It Sy aclmode Ns = Ns Cm discard | groupmask | passthrough
+.It Sy aclmode Ns = Ns Cm discard | groupmask | passthrough | restricted
 Controls how an
 .Tn ACL
 is modified during
@@ -778,6 +778,32 @@
 other than creating or updating the necessary
 .Tn ACL
 entries to represent the new mode of the file or directory.
+An
+.Sy aclmode
+property of
+.Cm restricted
+will cause the
+.Xr chmod 2
+operation to return an error when used on any file or directory which has
+a non-trivial
+.Tn ACL
+whose entries can not be represented by a mode.
+.Xr chmod 2
+is required to change the set user ID, set group ID, or sticky bits on a file
+or directory, as they do not have equivalent
+.Tn ACL
+entries.
+In order to use
+.Xr chmod 2
+on a file or directory with a non-trivial
+.Tn ACL
+when
+.Sy aclmode
+is set to
+.Cm restricted ,
+you must first remove all
+.Tn ACL
+entries which do not represent the current mode.
 .It Sy atime Ns = Ns Cm on | off
 Controls whether the access time for files is updated when they are read.
 Turning this property off avoids producing write traffic when reading files and


Property changes on: trunk/cddl/contrib/opensolaris/cmd/zfs/zfs.8
___________________________________________________________________
Deleted: cvs2svn:cvs-rev
## -1 +0,0 ##
-1.2
\ No newline at end of property
Modified: trunk/cddl/contrib/opensolaris/cmd/zpool/zpool.8
===================================================================
--- trunk/cddl/contrib/opensolaris/cmd/zpool/zpool.8	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/cddl/contrib/opensolaris/cmd/zpool/zpool.8	2016-09-06 23:50:03 UTC (rev 7820)
@@ -23,7 +23,7 @@
 .\"
 .\" $MidnightBSD$
 .\"
-.Dd November 28, 2011
+.Dd November 15, 2012
 .Dt ZPOOL 8
 .Os
 .Sh NAME


Property changes on: trunk/cddl/contrib/opensolaris/cmd/zpool/zpool.8
___________________________________________________________________
Deleted: cvs2svn:cvs-rev
## -1 +0,0 ##
-1.2
\ No newline at end of property
Modified: trunk/cddl/contrib/opensolaris/cmd/ztest/ztest.c
===================================================================
--- trunk/cddl/contrib/opensolaris/cmd/ztest/ztest.c	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/cddl/contrib/opensolaris/cmd/ztest/ztest.c	2016-09-06 23:50:03 UTC (rev 7820)
@@ -203,6 +203,7 @@
 	ZTEST_IO_WRITE_ZEROES,
 	ZTEST_IO_TRUNCATE,
 	ZTEST_IO_SETATTR,
+	ZTEST_IO_REWRITE,
 	ZTEST_IO_TYPES
 };
 
@@ -1825,6 +1826,12 @@
 		    DMU_READ_NO_PREFETCH);
 
 		if (error == 0) {
+			blkptr_t *obp = dmu_buf_get_blkptr(db);
+			if (obp) {
+				ASSERT(BP_IS_HOLE(bp));
+				*bp = *obp;
+			}
+
 			zgd->zgd_db = db;
 			zgd->zgd_bp = bp;
 
@@ -1970,6 +1977,9 @@
 			continue;
 		}
 
+		/*
+		 * No object was found.
+		 */
 		if (od->od_object == 0)
 			continue;
 
@@ -2085,6 +2095,7 @@
 static void
 ztest_io(ztest_ds_t *zd, uint64_t object, uint64_t offset)
 {
+	int err;
 	ztest_block_tag_t wbt;
 	dmu_object_info_t doi;
 	enum ztest_io_type io_type;
@@ -2137,6 +2148,25 @@
 	case ZTEST_IO_SETATTR:
 		(void) ztest_setattr(zd, object);
 		break;
+
+	case ZTEST_IO_REWRITE:
+		(void) rw_rdlock(&ztest_name_lock);
+		err = ztest_dsl_prop_set_uint64(zd->zd_name,
+		    ZFS_PROP_CHECKSUM, spa_dedup_checksum(ztest_spa),
+		    B_FALSE);
+		VERIFY(err == 0 || err == ENOSPC);
+		err = ztest_dsl_prop_set_uint64(zd->zd_name,
+		    ZFS_PROP_COMPRESSION,
+		    ztest_random_dsl_prop(ZFS_PROP_COMPRESSION),
+		    B_FALSE);
+		VERIFY(err == 0 || err == ENOSPC);
+		(void) rw_unlock(&ztest_name_lock);
+
+		VERIFY0(dmu_read(zd->zd_os, object, offset, blocksize, data,
+		    DMU_READ_NO_PREFETCH));
+
+		(void) ztest_write(zd, object, offset, blocksize, data);
+		break;
 	}
 
 	(void) rw_unlock(&zd->zd_zilog_lock);
@@ -2224,6 +2254,12 @@
 {
 	objset_t *os = zd->zd_os;
 
+	/*
+	 * We grab the zd_dirobj_lock to ensure that no other thread is
+	 * updating the zil (i.e. adding in-memory log records) and the
+	 * zd_zilog_lock to block any I/O.
+ 	 */
+	VERIFY0(mutex_lock(&zd->zd_dirobj_lock));
 	(void) rw_wrlock(&zd->zd_zilog_lock);
 
 	/* zfsvfs_teardown() */
@@ -4798,8 +4834,8 @@
 	 */
 	for (int i = 0; i < copies; i++) {
 		uint64_t offset = i * blocksize;
-		VERIFY(dmu_buf_hold(os, object, offset, FTAG, &db,
-		    DMU_READ_NO_PREFETCH) == 0);
+		VERIFY0(dmu_buf_hold(os, object, offset, FTAG, &db,
+		    DMU_READ_NO_PREFETCH));
 		ASSERT(db->db_offset == offset);
 		ASSERT(db->db_size == blocksize);
 		ASSERT(ztest_pattern_match(db->db_data, db->db_size, pattern) ||
@@ -4815,8 +4851,8 @@
 	/*
 	 * Find out what block we got.
 	 */
-	VERIFY(dmu_buf_hold(os, object, 0, FTAG, &db,
-	    DMU_READ_NO_PREFETCH) == 0);
+	VERIFY0(dmu_buf_hold(os, object, 0, FTAG, &db,
+	    DMU_READ_NO_PREFETCH));
 	blk = *((dmu_buf_impl_t *)db)->db_blkptr;
 	dmu_buf_rele(db, FTAG);
 
@@ -5488,6 +5524,8 @@
 	kernel_init(FREAD | FWRITE);
 	VERIFY3U(0, ==, spa_open(ztest_opts.zo_pool, &spa, FTAG));
 	VERIFY3U(0, ==, ztest_dataset_open(0));
+	spa->spa_debug = B_TRUE;
+	ztest_spa = spa;
 
 	/*
 	 * Force the first log block to be transactionally allocated.


Property changes on: trunk/cddl/contrib/opensolaris/cmd/ztest/ztest.c
___________________________________________________________________
Deleted: cvs2svn:cvs-rev
## -1 +0,0 ##
-1.2
\ No newline at end of property
Modified: trunk/sys/cddl/contrib/opensolaris/common/zfs/zfs_prop.c
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/common/zfs/zfs_prop.c	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/common/zfs/zfs_prop.c	2016-09-06 23:50:03 UTC (rev 7820)
@@ -111,6 +111,7 @@
 		{ "discard",	ZFS_ACL_DISCARD },
 		{ "groupmask",	ZFS_ACL_GROUPMASK },
 		{ "passthrough", ZFS_ACL_PASSTHROUGH },
+		{ "restricted", ZFS_ACL_RESTRICTED },
 		{ NULL }
 	};
 
@@ -219,7 +220,8 @@
 	    "hidden | visible", "SNAPDIR", snapdir_table);
 	zprop_register_index(ZFS_PROP_ACLMODE, "aclmode", ZFS_ACL_DISCARD,
 	    PROP_INHERIT, ZFS_TYPE_FILESYSTEM,
-	    "discard | groupmask | passthrough", "ACLMODE", acl_mode_table);
+	    "discard | groupmask | passthrough | restricted", "ACLMODE",
+	    acl_mode_table);
 	zprop_register_index(ZFS_PROP_ACLINHERIT, "aclinherit",
 	    ZFS_ACL_RESTRICTED, PROP_INHERIT, ZFS_TYPE_FILESYSTEM,
 	    "discard | noallow | restricted | passthrough | passthrough-x",

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c	2016-09-06 23:50:03 UTC (rev 7820)
@@ -3537,6 +3537,12 @@
 				arc_hdr_destroy(exists);
 				exists = buf_hash_insert(hdr, &hash_lock);
 				ASSERT3P(exists, ==, NULL);
+			} else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
+				/* nopwrite */
+				ASSERT(zio->io_prop.zp_nopwrite);
+				if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))
+					panic("bad nopwrite, hdr=%p exists=%p",
+					    (void *)hdr, (void *)exists);
 			} else {
 				/* Dedup */
 				ASSERT(hdr->b_datacnt == 1);

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c	2016-09-06 23:50:03 UTC (rev 7820)
@@ -767,7 +767,7 @@
 	ASSERT(db->db_data_pending != dr);
 
 	/* free this block */
-	if (!BP_IS_HOLE(bp)) {
+	if (!BP_IS_HOLE(bp) && !dr->dt.dl.dr_nopwrite) {
 		spa_t *spa;
 
 		DB_GET_SPA(&spa, db);
@@ -774,6 +774,8 @@
 		zio_free(spa, txg, bp);
 	}
 	dr->dt.dl.dr_override_state = DR_NOT_OVERRIDDEN;
+	dr->dt.dl.dr_nopwrite = B_FALSE;
+
 	/*
 	 * Release the already-written buffer, so we leave it in
 	 * a consistent dirty state.  Note that all callers are
@@ -2171,6 +2173,13 @@
 	return (res);
 }
 
+blkptr_t *
+dmu_buf_get_blkptr(dmu_buf_t *db)
+{
+	dmu_buf_impl_t *dbi = (dmu_buf_impl_t *)db;
+	return (dbi->db_blkptr);
+}
+
 static void
 dbuf_check_blkptr(dnode_t *dn, dmu_buf_impl_t *db)
 {
@@ -2513,7 +2522,11 @@
 	ASSERT3U(zio->io_error, ==, 0);
 	ASSERT(db->db_blkptr == bp);
 
-	if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
+	/*
+	 * For nopwrites and rewrites we ensure that the bp matches our
+	 * original and bypass all the accounting.
+	 */
+	if (zio->io_flags & (ZIO_FLAG_IO_REWRITE | ZIO_FLAG_NOPWRITE)) {
 		ASSERT(BP_EQUAL(bp, bp_orig));
 	} else {
 		objset_t *os;
@@ -2704,7 +2717,7 @@
 		mutex_enter(&db->db_mtx);
 		dr->dt.dl.dr_override_state = DR_NOT_OVERRIDDEN;
 		zio_write_override(dr->dr_zio, &dr->dt.dl.dr_overridden_by,
-		    dr->dt.dl.dr_copies);
+		    dr->dt.dl.dr_copies, dr->dt.dl.dr_nopwrite);
 		mutex_exit(&db->db_mtx);
 	} else if (db->db_state == DB_NOFILL) {
 		ASSERT(zp.zp_checksum == ZIO_CHECKSUM_OFF);

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c	2016-09-06 23:50:03 UTC (rev 7820)
@@ -39,11 +39,21 @@
 #include <sys/zfs_ioctl.h>
 #include <sys/zap.h>
 #include <sys/zio_checksum.h>
+#include <sys/zio_compress.h>
 #include <sys/sa.h>
 #ifdef _KERNEL
 #include <sys/zfs_znode.h>
 #endif
 
+/*
+ * Enable/disable nopwrite feature.
+ */
+int zfs_nopwrite_enabled = 1;
+SYSCTL_DECL(_vfs_zfs);
+TUNABLE_INT("vfs.zfs.nopwrite_enabled", &zfs_nopwrite_enabled);
+SYSCTL_INT(_vfs_zfs, OID_AUTO, nopwrite_enabled, CTLFLAG_RDTUN,
+    &zfs_nopwrite_enabled, 0, "Enable nopwrite feature");
+
 const dmu_object_type_info_t dmu_ot[DMU_OT_NUMTYPES] = {
 	{	byteswap_uint8_array,	TRUE,	"unallocated"		},
 	{	zap_byteswap,		TRUE,	"object directory"	},
@@ -1273,6 +1283,16 @@
 	mutex_enter(&db->db_mtx);
 	ASSERT(dr->dt.dl.dr_override_state == DR_IN_DMU_SYNC);
 	if (zio->io_error == 0) {
+		dr->dt.dl.dr_nopwrite = !!(zio->io_flags & ZIO_FLAG_NOPWRITE);
+		if (dr->dt.dl.dr_nopwrite) {
+			blkptr_t *bp = zio->io_bp;
+			blkptr_t *bp_orig = &zio->io_bp_orig;
+			uint8_t chksum = BP_GET_CHECKSUM(bp_orig);
+
+			ASSERT(BP_EQUAL(bp, bp_orig));
+			ASSERT(zio->io_prop.zp_compress != ZIO_COMPRESS_OFF);
+			ASSERT(zio_checksum_table[chksum].ci_dedup);
+		}
 		dr->dt.dl.dr_overridden_by = *zio->io_bp;
 		dr->dt.dl.dr_override_state = DR_OVERRIDDEN;
 		dr->dt.dl.dr_copies = zio->io_prop.zp_copies;
@@ -1294,11 +1314,22 @@
 {
 	blkptr_t *bp = zio->io_bp;
 	dmu_sync_arg_t *dsa = zio->io_private;
+	blkptr_t *bp_orig = &zio->io_bp_orig;
 
 	if (zio->io_error == 0 && !BP_IS_HOLE(bp)) {
-		ASSERT(zio->io_bp->blk_birth == zio->io_txg);
-		ASSERT(zio->io_txg > spa_syncing_txg(zio->io_spa));
-		zio_free(zio->io_spa, zio->io_txg, zio->io_bp);
+		/*
+		 * If we didn't allocate a new block (i.e. ZIO_FLAG_NOPWRITE)
+		 * then there is nothing to do here. Otherwise, free the
+		 * newly allocated block in this txg.
+		 */
+		if (zio->io_flags & ZIO_FLAG_NOPWRITE) {
+			ASSERT(BP_EQUAL(bp, bp_orig));
+		} else {
+			ASSERT(BP_IS_HOLE(bp_orig) || !BP_EQUAL(bp, bp_orig));
+			ASSERT(zio->io_bp->blk_birth == zio->io_txg);
+			ASSERT(zio->io_txg > spa_syncing_txg(zio->io_spa));
+			zio_free(zio->io_spa, zio->io_txg, zio->io_bp);
+		}
 	}
 
 	dmu_tx_commit(dsa->dsa_tx);
@@ -1343,7 +1374,7 @@
  *
  * Return values:
  *
- *	EEXIST: this txg has already been synced, so there's nothing to to.
+ *	EEXIST: this txg has already been synced, so there's nothing to do.
  *		The caller should not log the write.
  *
  *	ENOENT: the block was dbuf_free_range()'d, so there's nothing to do.
@@ -1375,7 +1406,6 @@
 	dnode_t *dn;
 
 	ASSERT(pio != NULL);
-	ASSERT(BP_IS_HOLE(bp));
 	ASSERT(txg != 0);
 
 	SET_BOOKMARK(&zb, ds->ds_object,
@@ -1430,6 +1460,23 @@
 		return (ENOENT);
 	}
 
+	ASSERT(dr->dr_next == NULL || dr->dr_next->dr_txg < txg);
+
+	/*
+	 * Assume the on-disk data is X, the current syncing data is Y,
+	 * and the current in-memory data is Z (currently in dmu_sync).
+	 * X and Z are identical but Y is has been modified. Normally,
+	 * when X and Z are the same we will perform a nopwrite but if Y
+	 * is different we must disable nopwrite since the resulting write
+	 * of Y to disk can free the block containing X. If we allowed a
+	 * nopwrite to occur the block pointing to Z would reference a freed
+	 * block. Since this is a rare case we simplify this by disabling
+	 * nopwrite if the current dmu_sync-ing dbuf has been modified in
+	 * a previous transaction.
+	 */
+	if (dr->dr_next)
+		zp.zp_nopwrite = B_FALSE;
+
 	ASSERT(dr->dr_txg == txg);
 	if (dr->dt.dl.dr_override_state == DR_IN_DMU_SYNC ||
 	    dr->dt.dl.dr_override_state == DR_OVERRIDDEN) {
@@ -1505,7 +1552,6 @@
 
 int zfs_mdcomp_disable = 0;
 TUNABLE_INT("vfs.zfs.mdcomp_disable", &zfs_mdcomp_disable);
-SYSCTL_DECL(_vfs_zfs);
 SYSCTL_INT(_vfs_zfs, OID_AUTO, mdcomp_disable, CTLFLAG_RW,
     &zfs_mdcomp_disable, 0, "Disable metadata compression");
 
@@ -1518,15 +1564,27 @@
 	enum zio_checksum checksum = os->os_checksum;
 	enum zio_compress compress = os->os_compress;
 	enum zio_checksum dedup_checksum = os->os_dedup_checksum;
-	boolean_t dedup;
+	boolean_t dedup = B_FALSE;
+	boolean_t nopwrite = B_FALSE;
 	boolean_t dedup_verify = os->os_dedup_verify;
 	int copies = os->os_copies;
 
 	/*
-	 * Determine checksum setting.
+	 * We maintain different write policies for each of the following
+	 * types of data:
+	 *	 1. metadata
+	 *	 2. preallocated blocks (i.e. level-0 blocks of a dump device)
+	 *	 3. all other level 0 blocks
 	 */
 	if (ismd) {
 		/*
+		 * XXX -- we should design a compression algorithm
+		 * that specializes in arrays of bps.
+		 */
+		compress = zfs_mdcomp_disable ? ZIO_COMPRESS_EMPTY :
+		    ZIO_COMPRESS_LZJB;
+
+		/*
 		 * Metadata always gets checksummed.  If the data
 		 * checksum is multi-bit correctable, and it's not a
 		 * ZBT-style checksum, then it's suitable for metadata
@@ -1536,45 +1594,47 @@
 		if (zio_checksum_table[checksum].ci_correctable < 1 ||
 		    zio_checksum_table[checksum].ci_eck)
 			checksum = ZIO_CHECKSUM_FLETCHER_4;
-	} else {
-		checksum = zio_checksum_select(dn->dn_checksum, checksum);
-	}
+	} else if (wp & WP_NOFILL) {
+		ASSERT(level == 0);
 
-	/*
-	 * Determine compression setting.
-	 */
-	if (ismd) {
 		/*
-		 * XXX -- we should design a compression algorithm
-		 * that specializes in arrays of bps.
+		 * If we're writing preallocated blocks, we aren't actually
+		 * writing them so don't set any policy properties.  These
+		 * blocks are currently only used by an external subsystem
+		 * outside of zfs (i.e. dump) and not written by the zio
+		 * pipeline.
 		 */
-		compress = zfs_mdcomp_disable ? ZIO_COMPRESS_EMPTY :
-		    ZIO_COMPRESS_LZJB;
+		compress = ZIO_COMPRESS_OFF;
+		checksum = ZIO_CHECKSUM_OFF;
 	} else {
 		compress = zio_compress_select(dn->dn_compress, compress);
-	}
 
-	/*
-	 * Determine dedup setting.  If we are in dmu_sync(), we won't
-	 * actually dedup now because that's all done in syncing context;
-	 * but we do want to use the dedup checkum.  If the checksum is not
-	 * strong enough to ensure unique signatures, force dedup_verify.
-	 */
-	dedup = (!ismd && dedup_checksum != ZIO_CHECKSUM_OFF);
-	if (dedup) {
-		checksum = dedup_checksum;
-		if (!zio_checksum_table[checksum].ci_dedup)
-			dedup_verify = 1;
-	}
+		checksum = (dedup_checksum == ZIO_CHECKSUM_OFF) ?
+		    zio_checksum_select(dn->dn_checksum, checksum) :
+		    dedup_checksum;
 
-	if (wp & WP_DMU_SYNC)
-		dedup = 0;
+		/*
+		 * Determine dedup setting.  If we are in dmu_sync(),
+		 * we won't actually dedup now because that's all
+		 * done in syncing context; but we do want to use the
+		 * dedup checkum.  If the checksum is not strong
+		 * enough to ensure unique signatures, force
+		 * dedup_verify.
+		 */
+		if (dedup_checksum != ZIO_CHECKSUM_OFF) {
+			dedup = (wp & WP_DMU_SYNC) ? B_FALSE : B_TRUE;
+			if (!zio_checksum_table[checksum].ci_dedup)
+				dedup_verify = B_TRUE;
+		}
 
-	if (wp & WP_NOFILL) {
-		ASSERT(!ismd && level == 0);
-		checksum = ZIO_CHECKSUM_OFF;
-		compress = ZIO_COMPRESS_OFF;
-		dedup = B_FALSE;
+		/*
+		 * Enable nopwrite if we have a cryptographically secure
+		 * checksum that has no known collisions (i.e. SHA-256)
+		 * and compression is enabled.  We don't enable nopwrite if
+		 * dedup is enabled as the two features are mutually exclusive.
+		 */
+		nopwrite = (!dedup && zio_checksum_table[checksum].ci_dedup &&
+		    compress != ZIO_COMPRESS_OFF && zfs_nopwrite_enabled);
 	}
 
 	zp->zp_checksum = checksum;
@@ -1584,6 +1644,7 @@
 	zp->zp_copies = MIN(copies + ismd, spa_max_replication(os->os_spa));
 	zp->zp_dedup = dedup;
 	zp->zp_dedup_verify = dedup && dedup_verify;
+	zp->zp_nopwrite = nopwrite;
 }
 
 int

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c	2016-09-06 23:50:03 UTC (rev 7820)
@@ -881,8 +881,9 @@
 	if ((msp->ms_weight & METASLAB_ACTIVE_MASK) == 0) {
 		space_map_load_wait(sm);
 		if (!sm->sm_loaded) {
-			int error = space_map_load(sm, sm_ops, SM_FREE,
-			    &msp->ms_smo,
+			space_map_obj_t *smo = &msp->ms_smo;
+
+			int error = space_map_load(sm, sm_ops, SM_FREE, smo,
 			    spa_meta_objset(msp->ms_group->mg_vd->vdev_spa));
 			if (error)  {
 				metaslab_group_sort(msp->ms_group, msp, 0);

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c	2016-09-06 23:50:03 UTC (rev 7820)
@@ -135,6 +135,7 @@
 uint_t		zio_taskq_basedc = 80;		/* base duty cycle */
 
 boolean_t	spa_create_process = B_TRUE;	/* no process ==> no sysdc */
+extern int	zfs_sync_pass_deferred_free;
 
 /*
  * This (illegal) pool name is used when temporarily importing a spa_t in order
@@ -5714,7 +5715,7 @@
 		spa_errlog_sync(spa, txg);
 		dsl_pool_sync(dp, txg);
 
-		if (pass <= SYNC_PASS_DEFERRED_FREE) {
+		if (pass < zfs_sync_pass_deferred_free) {
 			zio_t *zio = zio_root(spa, NULL, NULL, 0);
 			bplist_iterate(free_bpl, spa_free_sync_cb,
 			    zio, tx);

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_misc.c	2016-09-06 23:50:03 UTC (rev 7820)
@@ -1547,6 +1547,7 @@
 
 	refcount_sysinit();
 	unique_init();
+	space_map_init();
 	zio_init();
 	dmu_init();
 	zil_init();
@@ -1568,6 +1569,7 @@
 	zil_fini();
 	dmu_fini();
 	zio_fini();
+	space_map_fini();
 	unique_fini();
 	refcount_fini();
 

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/space_map.c	2016-09-06 23:50:03 UTC (rev 7820)
@@ -29,6 +29,23 @@
 #include <sys/zio.h>
 #include <sys/space_map.h>
 
+static kmem_cache_t *space_seg_cache;
+
+void
+space_map_init(void)
+{
+	ASSERT(space_seg_cache == NULL);
+	space_seg_cache = kmem_cache_create("space_seg_cache",
+	    sizeof (space_seg_t), 0, NULL, NULL, NULL, NULL, NULL, 0);
+}
+
+void
+space_map_fini(void)
+{
+	kmem_cache_destroy(space_seg_cache);
+	space_seg_cache = NULL;
+}
+
 /*
  * Space map routines.
  * NOTE: caller is responsible for all locking.
@@ -121,7 +138,7 @@
 			avl_remove(sm->sm_pp_root, ss_after);
 		}
 		ss_after->ss_start = ss_before->ss_start;
-		kmem_free(ss_before, sizeof (*ss_before));
+		kmem_cache_free(space_seg_cache, ss_before);
 		ss = ss_after;
 	} else if (merge_before) {
 		ss_before->ss_end = end;
@@ -134,7 +151,7 @@
 			avl_remove(sm->sm_pp_root, ss_after);
 		ss = ss_after;
 	} else {
-		ss = kmem_alloc(sizeof (*ss), KM_SLEEP);
+		ss = kmem_cache_alloc(space_seg_cache, KM_SLEEP);
 		ss->ss_start = start;
 		ss->ss_end = end;
 		avl_insert(&sm->sm_root, ss, where);
@@ -181,7 +198,7 @@
 		avl_remove(sm->sm_pp_root, ss);
 
 	if (left_over && right_over) {
-		newseg = kmem_alloc(sizeof (*newseg), KM_SLEEP);
+		newseg = kmem_cache_alloc(space_seg_cache, KM_SLEEP);
 		newseg->ss_start = end;
 		newseg->ss_end = ss->ss_end;
 		ss->ss_end = start;
@@ -194,7 +211,7 @@
 		ss->ss_start = end;
 	} else {
 		avl_remove(&sm->sm_root, ss);
-		kmem_free(ss, sizeof (*ss));
+		kmem_cache_free(space_seg_cache, ss);
 		ss = NULL;
 	}
 
@@ -234,7 +251,7 @@
 	while ((ss = avl_destroy_nodes(&sm->sm_root, &cookie)) != NULL) {
 		if (func != NULL)
 			func(mdest, ss->ss_start, ss->ss_end - ss->ss_start);
-		kmem_free(ss, sizeof (*ss));
+		kmem_cache_free(space_seg_cache, ss);
 	}
 	sm->sm_space = 0;
 }
@@ -406,7 +423,7 @@
 	spa_t *spa = dmu_objset_spa(os);
 	void *cookie = NULL;
 	space_seg_t *ss;
-	uint64_t bufsize, start, size, run_len;
+	uint64_t bufsize, start, size, run_len, delta, sm_space;
 	uint64_t *entry, *entry_map, *entry_map_end;
 
 	ASSERT(MUTEX_HELD(sm->sm_lock));
@@ -435,11 +452,13 @@
 	    SM_DEBUG_SYNCPASS_ENCODE(spa_sync_pass(spa)) |
 	    SM_DEBUG_TXG_ENCODE(dmu_tx_get_txg(tx));
 
+	delta = 0;
+	sm_space = sm->sm_space;
 	while ((ss = avl_destroy_nodes(&sm->sm_root, &cookie)) != NULL) {
 		size = ss->ss_end - ss->ss_start;
 		start = (ss->ss_start - sm->sm_start) >> sm->sm_shift;
 
-		sm->sm_space -= size;
+		delta += size;
 		size >>= sm->sm_shift;
 
 		while (size) {
@@ -461,7 +480,7 @@
 			start += run_len;
 			size -= run_len;
 		}
-		kmem_free(ss, sizeof (*ss));
+		kmem_cache_free(space_seg_cache, ss);
 	}
 
 	if (entry != entry_map) {
@@ -473,8 +492,15 @@
 		smo->smo_objsize += size;
 	}
 
+	/*
+	 * Ensure that the space_map's accounting wasn't changed
+	 * while we were in the middle of writing it out.
+	 */
+	VERIFY3U(sm->sm_space, ==, sm_space);
+
 	zio_buf_free(entry_map, bufsize);
 
+	sm->sm_space -= delta;
 	VERIFY3U(sm->sm_space, ==, 0);
 }
 

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dbuf.h
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dbuf.h	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dbuf.h	2016-09-06 23:50:03 UTC (rev 7820)
@@ -20,6 +20,7 @@
  */
 /*
  * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2012 by Delphix. All rights reserved.
  */
 
 #ifndef	_SYS_DBUF_H
@@ -130,6 +131,7 @@
 			blkptr_t dr_overridden_by;
 			override_states_t dr_override_state;
 			uint8_t dr_copies;
+			boolean_t dr_nopwrite;
 		} dl;
 	} dt;
 } dbuf_dirty_record_t;

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h	2016-09-06 23:50:03 UTC (rev 7820)
@@ -426,6 +426,11 @@
 void *dmu_buf_get_user(dmu_buf_t *db);
 
 /*
+ * Returns the blkptr associated with this dbuf, or NULL if not set.
+ */
+struct blkptr *dmu_buf_get_blkptr(dmu_buf_t *db);
+
+/*
  * Indicate that you are going to modify the buffer's data (db_data).
  *
  * The transaction (tx) must be assigned to a txg (ie. you've called

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h	2016-09-06 23:50:03 UTC (rev 7820)
@@ -21,9 +21,12 @@
 /*
  * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
  * Use is subject to license terms.
- * Copyright (c) 2011 by Delphix. All rights reserved.
  */
 
+/*
+ * Copyright (c) 2012 by Delphix. All rights reserved.
+ */
+
 #ifndef _SYS_METASLAB_IMPL_H
 #define	_SYS_METASLAB_IMPL_H
 

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h	2016-09-06 23:50:03 UTC (rev 7820)
@@ -485,14 +485,6 @@
 extern void spa_sync(spa_t *spa, uint64_t txg); /* only for DMU use */
 extern void spa_sync_allpools(void);
 
-/*
- * DEFERRED_FREE must be large enough that regular blocks are not
- * deferred.  XXX so can't we change it back to 1?
- */
-#define	SYNC_PASS_DEFERRED_FREE	2	/* defer frees after this pass */
-#define	SYNC_PASS_DONT_COMPRESS	4	/* don't compress after this pass */
-#define	SYNC_PASS_REWRITE	1	/* rewrite new bps after this pass */
-
 /* spa namespace global mutex */
 extern kmutex_t spa_namespace_lock;
 

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/space_map.h
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/space_map.h	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/space_map.h	2016-09-06 23:50:03 UTC (rev 7820)
@@ -23,6 +23,10 @@
  * Use is subject to license terms.
  */
 
+/*
+ * Copyright (c) 2012 by Delphix. All rights reserved.
+ */
+
 #ifndef _SYS_SPACE_MAP_H
 #define	_SYS_SPACE_MAP_H
 
@@ -136,6 +140,8 @@
 
 typedef void space_map_func_t(space_map_t *sm, uint64_t start, uint64_t size);
 
+extern void space_map_init(void);
+extern void space_map_fini(void);
 extern void space_map_create(space_map_t *sm, uint64_t start, uint64_t size,
     uint8_t shift, kmutex_t *lp);
 extern void space_map_destroy(space_map_t *sm);

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio.h	2016-09-06 23:50:03 UTC (rev 7820)
@@ -190,7 +190,9 @@
 	ZIO_FLAG_RAW		= 1 << 21,
 	ZIO_FLAG_GANG_CHILD	= 1 << 22,
 	ZIO_FLAG_DDT_CHILD	= 1 << 23,
-	ZIO_FLAG_GODFATHER	= 1 << 24
+	ZIO_FLAG_GODFATHER	= 1 << 24,
+	ZIO_FLAG_NOPWRITE	= 1 << 25,
+	ZIO_FLAG_REEXECUTED	= 1 << 26,
 };
 
 #define	ZIO_FLAG_MUSTSUCCEED		0
@@ -281,8 +283,9 @@
 	dmu_object_type_t	zp_type;
 	uint8_t			zp_level;
 	uint8_t			zp_copies;
-	uint8_t			zp_dedup;
-	uint8_t			zp_dedup_verify;
+	boolean_t		zp_dedup;
+	boolean_t		zp_dedup_verify;
+	boolean_t		zp_nopwrite;
 } zio_prop_t;
 
 typedef struct zio_cksum_report zio_cksum_report_t;
@@ -484,7 +487,8 @@
     void *data, uint64_t size, zio_done_func_t *done, void *priv,
     int priority, enum zio_flag flags, zbookmark_t *zb);
 
-extern void zio_write_override(zio_t *zio, blkptr_t *bp, int copies);
+extern void zio_write_override(zio_t *zio, blkptr_t *bp, int copies,
+    boolean_t nopwrite);
 
 extern void zio_free(spa_t *spa, uint64_t txg, const blkptr_t *bp);
 

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_impl.h
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_impl.h	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_impl.h	2016-09-06 23:50:03 UTC (rev 7820)
@@ -23,6 +23,10 @@
  * Use is subject to license terms.
  */
 
+/*
+ * Copyright (c) 2012 by Delphix. All rights reserved.
+ */
+
 #ifndef _ZIO_IMPL_H
 #define	_ZIO_IMPL_H
 
@@ -34,6 +38,70 @@
 #endif
 
 /*
+ * XXX -- Describe ZFS I/O pipleine here. Fill in as needed.
+ *
+ * The ZFS I/O pipeline is comprised of various stages which are defined
+ * in the zio_stage enum below. The individual stages are used to construct
+ * these basic I/O operations: Read, Write, Free, Claim, and Ioctl.
+ *
+ * I/O operations: (XXX - provide detail for each of the operations)
+ *
+ * Read:
+ * Write:
+ * Free:
+ * Claim:
+ * Ioctl:
+ *
+ * Although the most common pipeline are used by the basic I/O operations
+ * above, there are some helper pipelines (one could consider them
+ * sub-pipelines) which are used internally by the ZIO module and are
+ * explained below:
+ *
+ * Interlock Pipeline:
+ * The interlock pipeline is the most basic pipeline and is used by all
+ * of the I/O operations. The interlock pipeline does not perform any I/O
+ * and is used to coordinate the dependencies between I/Os that are being
+ * issued (i.e. the parent/child relationship).
+ *
+ * Vdev child Pipeline:
+ * The vdev child pipeline is responsible for performing the physical I/O.
+ * It is in this pipeline where the I/O are queued and possibly cached.
+ *
+ * In addition to performing I/O, the pipeline is also responsible for
+ * data transformations. The transformations performed are based on the
+ * specific properties that user may have selected and modify the
+ * behavior of the pipeline. Examples of supported transformations are
+ * compression, dedup, and nop writes. Transformations will either modify
+ * the data or the pipeline. This list below further describes each of
+ * the supported transformations:
+ *
+ * Compression:
+ * ZFS supports three different flavors of compression -- gzip, lzjb, and
+ * zle. Compression occurs as part of the write pipeline and is performed
+ * in the ZIO_STAGE_WRITE_BP_INIT stage.
+ *
+ * Dedup:
+ * Dedup reads are handled by the ZIO_STAGE_DDT_READ_START and
+ * ZIO_STAGE_DDT_READ_DONE stages. These stages are added to an existing
+ * read pipeline if the dedup bit is set on the block pointer.
+ * Writing a dedup block is performed by the ZIO_STAGE_DDT_WRITE stage
+ * and added to a write pipeline if a user has enabled dedup on that
+ * particular dataset.
+ *
+ * NOP Write:
+ * The NOP write feature is performed by the ZIO_STAGE_NOP_WRITE stage
+ * and is added to an existing write pipeline if a crypographically
+ * secure checksum (i.e. SHA256) is enabled and compression is turned on.
+ * The NOP write stage will compare the checksums of the current data
+ * on-disk (level-0 blocks only) and the data that is currently being written.
+ * If the checksum values are identical then the pipeline is converted to
+ * an interlock pipeline skipping block allocation and bypassing the
+ * physical I/O.  The nop write feature can handle writes in either
+ * syncing or open context (i.e. zil writes) and as a result is mutually
+ * exclusive with dedup.
+ */
+
+/*
  * zio pipeline stage definitions
  */
 enum zio_stage {

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	2016-09-06 23:50:03 UTC (rev 7820)
@@ -1205,6 +1205,12 @@
 			    DMU_READ_NO_PREFETCH);
 
 		if (error == 0) {
+			blkptr_t *obp = dmu_buf_get_blkptr(db);
+			if (obp) {
+				ASSERT(BP_IS_HOLE(bp));
+				*bp = *obp;
+			}
+
 			zgd->zgd_db = db;
 			zgd->zgd_bp = bp;
 
@@ -3247,6 +3253,12 @@
 		uint64_t acl_obj;
 		new_mode = (pmode & S_IFMT) | (vap->va_mode & ~S_IFMT);
 
+		if (zp->z_zfsvfs->z_acl_mode == ZFS_ACL_RESTRICTED &&
+		    !(zp->z_pflags & ZFS_ACL_TRIVIAL)) {
+			err = EPERM;
+			goto out;
+		}
+
 		if (err = zfs_acl_chmod_setattr(zp, &aclp, new_mode))
 			goto out;
 

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c	2016-09-06 23:50:03 UTC (rev 7820)
@@ -104,6 +104,31 @@
 extern int zfs_mg_alloc_failures;
 
 /*
+ * The following actions directly effect the spa's sync-to-convergence logic.
+ * The values below define the sync pass when we start performing the action.
+ * Care should be taken when changing these values as they directly impact
+ * spa_sync() performance. Tuning these values may introduce subtle performance
+ * pathologies and should only be done in the context of performance analysis.
+ * These tunables will eventually be removed and replaced with #defines once
+ * enough analysis has been done to determine optimal values.
+ *
+ * The 'zfs_sync_pass_deferred_free' pass must be greater than 1 to ensure that
+ * regular blocks are not deferred.
+ */
+int zfs_sync_pass_deferred_free = 2; /* defer frees starting in this pass */
+TUNABLE_INT("vfs.zfs.sync_pass_deferred_free", &zfs_sync_pass_deferred_free);
+SYSCTL_INT(_vfs_zfs, OID_AUTO, sync_pass_deferred_free, CTLFLAG_RDTUN,
+    &zfs_sync_pass_deferred_free, 0, "defer frees starting in this pass");
+int zfs_sync_pass_dont_compress = 5; /* don't compress starting in this pass */
+TUNABLE_INT("vfs.zfs.sync_pass_dont_compress", &zfs_sync_pass_dont_compress);
+SYSCTL_INT(_vfs_zfs, OID_AUTO, sync_pass_dont_compress, CTLFLAG_RDTUN,
+    &zfs_sync_pass_dont_compress, 0, "don't compress starting in this pass");
+int zfs_sync_pass_rewrite = 2; /* rewrite new bps starting in this pass */
+TUNABLE_INT("vfs.zfs.sync_pass_rewrite", &zfs_sync_pass_rewrite);
+SYSCTL_INT(_vfs_zfs, OID_AUTO, sync_pass_rewrite, CTLFLAG_RDTUN,
+    &zfs_sync_pass_rewrite, 0, "rewrite new bps starting in this pass");
+
+/*
  * An allocating zio is one that either currently has the DVA allocate
  * stage set or will have it later in its lifetime.
  */
@@ -668,9 +693,7 @@
 	    zp->zp_type < DMU_OT_NUMTYPES &&
 	    zp->zp_level < 32 &&
 	    zp->zp_copies > 0 &&
-	    zp->zp_copies <= spa_max_replication(spa) &&
-	    zp->zp_dedup <= 1 &&
-	    zp->zp_dedup_verify <= 1);
+	    zp->zp_copies <= spa_max_replication(spa));
 
 	zio = zio_create(pio, spa, txg, bp, data, size, done, private,
 	    ZIO_TYPE_WRITE, priority, flags, NULL, 0, zb,
@@ -698,7 +721,7 @@
 }
 
 void
-zio_write_override(zio_t *zio, blkptr_t *bp, int copies)
+zio_write_override(zio_t *zio, blkptr_t *bp, int copies, boolean_t nopwrite)
 {
 	ASSERT(zio->io_type == ZIO_TYPE_WRITE);
 	ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
@@ -705,6 +728,13 @@
 	ASSERT(zio->io_stage == ZIO_STAGE_OPEN);
 	ASSERT(zio->io_txg == spa_syncing_txg(zio->io_spa));
 
+	/*
+	 * We must reset the io_prop to match the values that existed
+	 * when the bp was first written by dmu_sync() keeping in mind
+	 * that nopwrite and dedup are mutually exclusive.
+	 */
+	zio->io_prop.zp_dedup = nopwrite ? B_FALSE : zio->io_prop.zp_dedup;
+	zio->io_prop.zp_nopwrite = nopwrite;
 	zio->io_prop.zp_copies = copies;
 	zio->io_bp_override = bp;
 }
@@ -726,7 +756,7 @@
 
 	ASSERT(!BP_IS_HOLE(bp));
 	ASSERT(spa_syncing_txg(spa) == txg);
-	ASSERT(spa_sync_pass(spa) <= SYNC_PASS_DEFERRED_FREE);
+	ASSERT(spa_sync_pass(spa) < zfs_sync_pass_deferred_free);
 
 	zio = zio_create(pio, spa, txg, bp, NULL, size,
 	    NULL, NULL, ZIO_TYPE_FREE, ZIO_PRIORITY_FREE, flags,
@@ -1004,6 +1034,19 @@
 		*bp = *zio->io_bp_override;
 		zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
 
+		/*
+		 * If we've been overridden and nopwrite is set then
+		 * set the flag accordingly to indicate that a nopwrite
+		 * has already occurred.
+		 */
+		if (!BP_IS_HOLE(bp) && zp->zp_nopwrite) {
+			ASSERT(!zp->zp_dedup);
+			zio->io_flags |= ZIO_FLAG_NOPWRITE;
+			return (ZIO_PIPELINE_CONTINUE);
+		}
+
+		ASSERT(!zp->zp_nopwrite);
+
 		if (BP_IS_HOLE(bp) || !zp->zp_dedup)
 			return (ZIO_PIPELINE_CONTINUE);
 
@@ -1035,7 +1078,7 @@
 		ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
 		ASSERT(!BP_GET_DEDUP(bp));
 
-		if (pass > SYNC_PASS_DONT_COMPRESS)
+		if (pass >= zfs_sync_pass_dont_compress)
 			compress = ZIO_COMPRESS_OFF;
 
 		/* Make sure someone doesn't change their mind on overwrites */
@@ -1064,7 +1107,7 @@
 	 * There should only be a handful of blocks after pass 1 in any case.
 	 */
 	if (bp->blk_birth == zio->io_txg && BP_GET_PSIZE(bp) == psize &&
-	    pass > SYNC_PASS_REWRITE) {
+	    pass >= zfs_sync_pass_rewrite) {
 		ASSERT(psize != 0);
 		enum zio_stage gang_stages = zio->io_pipeline & ZIO_GANG_STAGES;
 		zio->io_pipeline = ZIO_REWRITE_PIPELINE | gang_stages;
@@ -1091,6 +1134,11 @@
 			ASSERT(!(zio->io_flags & ZIO_FLAG_IO_REWRITE));
 			zio->io_pipeline = ZIO_DDT_WRITE_PIPELINE;
 		}
+		if (zp->zp_nopwrite) {
+			ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
+			ASSERT(!(zio->io_flags & ZIO_FLAG_IO_REWRITE));
+			zio->io_pipeline |= ZIO_STAGE_NOP_WRITE;
+		}
 	}
 
 	return (ZIO_PIPELINE_CONTINUE);
@@ -1312,6 +1360,7 @@
 	pio->io_stage = pio->io_orig_stage;
 	pio->io_pipeline = pio->io_orig_pipeline;
 	pio->io_reexecute = 0;
+	pio->io_flags |= ZIO_FLAG_REEXECUTED;
 	pio->io_error = 0;
 	for (int w = 0; w < ZIO_WAIT_TYPES; w++)
 		pio->io_state[w] = 0;
@@ -1788,8 +1837,9 @@
 		zp.zp_type = DMU_OT_NONE;
 		zp.zp_level = 0;
 		zp.zp_copies = gio->io_prop.zp_copies;
-		zp.zp_dedup = 0;
-		zp.zp_dedup_verify = 0;
+		zp.zp_dedup = B_FALSE;
+		zp.zp_dedup_verify = B_FALSE;
+		zp.zp_nopwrite = B_FALSE;
 
 		zio_nowait(zio_write(zio, spa, txg, &gbh->zg_blkptr[g],
 		    (char *)pio->io_data + (pio->io_size - resid), lsize, &zp,
@@ -1809,6 +1859,62 @@
 }
 
 /*
+ * The zio_nop_write stage in the pipeline determines if allocating
+ * a new bp is necessary.  By leveraging a cryptographically secure checksum,
+ * such as SHA256, we can compare the checksums of the new data and the old
+ * to determine if allocating a new block is required.  The nopwrite
+ * feature can handle writes in either syncing or open context (i.e. zil
+ * writes) and as a result is mutually exclusive with dedup.
+ */
+static int
+zio_nop_write(zio_t *zio)
+{
+	blkptr_t *bp = zio->io_bp;
+	blkptr_t *bp_orig = &zio->io_bp_orig;
+	zio_prop_t *zp = &zio->io_prop;
+
+	ASSERT(BP_GET_LEVEL(bp) == 0);
+	ASSERT(!(zio->io_flags & ZIO_FLAG_IO_REWRITE));
+	ASSERT(zp->zp_nopwrite);
+	ASSERT(!zp->zp_dedup);
+	ASSERT(zio->io_bp_override == NULL);
+	ASSERT(IO_IS_ALLOCATING(zio));
+
+	/*
+	 * Check to see if the original bp and the new bp have matching
+	 * characteristics (i.e. same checksum, compression algorithms, etc).
+	 * If they don't then just continue with the pipeline which will
+	 * allocate a new bp.
+	 */
+	if (BP_IS_HOLE(bp_orig) ||
+	    !zio_checksum_table[BP_GET_CHECKSUM(bp)].ci_dedup ||
+	    BP_GET_CHECKSUM(bp) != BP_GET_CHECKSUM(bp_orig) ||
+	    BP_GET_COMPRESS(bp) != BP_GET_COMPRESS(bp_orig) ||
+	    BP_GET_DEDUP(bp) != BP_GET_DEDUP(bp_orig) ||
+	    zp->zp_copies != BP_GET_NDVAS(bp_orig))
+		return (ZIO_PIPELINE_CONTINUE);
+
+	/*
+	 * If the checksums match then reset the pipeline so that we
+	 * avoid allocating a new bp and issuing any I/O.
+	 */
+	if (ZIO_CHECKSUM_EQUAL(bp->blk_cksum, bp_orig->blk_cksum)) {
+		ASSERT(zio_checksum_table[zp->zp_checksum].ci_dedup);
+		ASSERT3U(BP_GET_PSIZE(bp), ==, BP_GET_PSIZE(bp_orig));
+		ASSERT3U(BP_GET_LSIZE(bp), ==, BP_GET_LSIZE(bp_orig));
+		ASSERT(zp->zp_compress != ZIO_COMPRESS_OFF);
+		ASSERT(bcmp(&bp->blk_prop, &bp_orig->blk_prop,
+		    sizeof (uint64_t)) == 0);
+
+		*bp = *bp_orig;
+		zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
+		zio->io_flags |= ZIO_FLAG_NOPWRITE;
+	}
+
+	return (ZIO_PIPELINE_CONTINUE);
+}
+
+/*
  * ==========================================================================
  * Dedup
  * ==========================================================================
@@ -2080,7 +2186,7 @@
 			zio->io_stage = ZIO_STAGE_OPEN;
 			BP_ZERO(bp);
 		} else {
-			zp->zp_dedup = 0;
+			zp->zp_dedup = B_FALSE;
 		}
 		zio->io_pipeline = ZIO_WRITE_PIPELINE;
 		ddt_exit(ddt);
@@ -2737,7 +2843,8 @@
 
 	if (zio->io_ready) {
 		ASSERT(IO_IS_ALLOCATING(zio));
-		ASSERT(bp->blk_birth == zio->io_txg || BP_IS_HOLE(bp));
+		ASSERT(bp->blk_birth == zio->io_txg || BP_IS_HOLE(bp) ||
+		    (zio->io_flags & ZIO_FLAG_NOPWRITE));
 		ASSERT(zio->io_children[ZIO_CHILD_GANG][ZIO_WAIT_READY] == 0);
 
 		zio->io_ready(zio);
@@ -2819,6 +2926,8 @@
 			ASSERT(BP_COUNT_GANG(bp) == 0 ||
 			    (BP_COUNT_GANG(bp) == BP_GET_NDVAS(bp)));
 		}
+		if (zio->io_flags & ZIO_FLAG_NOPWRITE)
+			VERIFY(BP_EQUAL(bp, &zio->io_bp_orig));
 	}
 
 	/*
@@ -2928,7 +3037,7 @@
 
 	if ((zio->io_error || zio->io_reexecute) &&
 	    IO_IS_ALLOCATING(zio) && zio->io_gang_leader == zio &&
-	    !(zio->io_flags & ZIO_FLAG_IO_REWRITE))
+	    !(zio->io_flags & (ZIO_FLAG_IO_REWRITE | ZIO_FLAG_NOPWRITE)))
 		zio_dva_unallocate(zio, zio->io_gang_tree, bp);
 
 	zio_gang_tree_free(&zio->io_gang_tree);
@@ -3072,6 +3181,7 @@
 	zio_issue_async,
 	zio_write_bp_init,
 	zio_checksum_generate,
+	zio_nop_write,
 	zio_ddt_read_start,
 	zio_ddt_read_done,
 	zio_ddt_write,

Modified: trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
===================================================================
--- trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c	2016-09-06 23:30:10 UTC (rev 7819)
+++ trunk/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c	2016-09-06 23:50:03 UTC (rev 7820)
@@ -78,6 +78,7 @@
 #include <sys/vdev_impl.h>
 #include <sys/zvol.h>
 #include <sys/zil_impl.h>
+#include <sys/dbuf.h>
 #include <geom/geom.h>
 
 #include "zfs_namecheck.h"
@@ -1017,6 +1018,12 @@
 		error = dmu_buf_hold(os, object, offset, zgd, &db,
 		    DMU_READ_NO_PREFETCH);
 		if (error == 0) {
+			blkptr_t *obp = dmu_buf_get_blkptr(db);
+			if (obp) {
+				ASSERT(BP_IS_HOLE(bp));
+				*bp = *obp;
+			}
+
 			zgd->zgd_db = db;
 			zgd->zgd_bp = bp;
 



More information about the Midnightbsd-cvs mailing list