Deduplicating data before it's written to disk defines inline data deduplication. This compares to post-process deduplication, also known as asynchronous deduplication, which analyzes and reduces data after it has been stored to disk.
Comparison to post-process deduplication
Inline deduplication is the most efficient and economic method of deduplication. It significantly reduces the raw disk capacity needed in the system since the full, not-yet-deduplicated data set is never written to disk. Inline deduplication also reduces time to disaster recovery (DR) readiness because the system doesn't need to wait to absorb the entire data set and then deduplicate it before it can begin replicating to the remote site.
Post-process deduplication technologies wait for data to land in full on disk before initiating the deduplication process. They increase lag time before deduplication is complete and, by extension, when replication completes as it's highly advantageous to replicate only deduplicated (small) data.
In practice, post-process deduplication creates operational issues since there are two storage zones, each with policies and behaviors to manage. In some cases, since the redundant storage zone is the default and more important design for some vendors, the deduplication zone is also much less performing and resilient.
In addition, this approach requires a greater initial capacity overhead than inline solutions. Post-process deduplication methods require additional capacity to temporarily store duplicate backup data.
How much disk capacity is needed may depend on the size of the backup data sets, how many backup jobs you run on a daily basis, and how long the deduplication technology "holds on" to capacity before releasing it. Post-process deduplication solutions that wait for the backup process to complete before beginning to deduplicate require larger disk caches than those that start the deduplication process during the backup process.