Deduplication of Rsnapshot Backup-Data possible??

    • OMV 2.x

    This site uses cookies. By continuing to browse this site, you are agreeing to our Cookie Policy.

    • Deduplication of Rsnapshot Backup-Data possible??

      Hello everybody!

      Is there any way to consolidate backup-data from incremental Rnsapshot backups?


      My problem: Disksize on backup-drive is getting low, since i had and still will have some bigger data moves inside my datastrukture. (ex. import of photos of a year, correction of a typo in a parent-folder). All of these causes a copy instead of a hardlink, even though there's actually no change of the data itself.

      Is there a possibility to deduplicate the data? I'm shure, that there will be a quite big potential to save data size.

      google got me to a program called "hardlink" ( manpages.debian.org/stretch/hardlink/hardlink.1.en.html or dinotools.de/2013/07/10/linux-platz-sparen-mit-hardlinks/ )

      Has anyone experience with this?

      I'm Linux-beginner, everytime again, when I've to maintain my NAS. Therefore I was not able to test it myself (probably caused by to old OMV 2.2-Version... not shure about that)


      every advice highly welcome!
    • I use rsync directly, not rsnspshot, for my incremental snapshot style backups. I suspect the backup storage is very similar.

      When I restructure the data storage I typically also purge all but the latest snapshot and restructure the latest snapshot to mimic the new data storage.That way I avoid that the same file is stored twice in the backup storage. To be doubly safe you could also delete the old snapshot when the new has finished.

      A clever backup utility should perhaps automatically hardlink to the same file in the backup storage even if it is in the wrong place, using a combination of file metadata and file checksum. Could also handle duplicate files in backup storage the same way.
      OMV 4, 7 x ODROID HC2, 1 x ODROID HC1, 5 x 12TB, 1 x 8TB, 1 x 2TB SSHD, 1 x 500GB SSD, GbE, WiFi mesh
    • Well, doing so i would not be able to roll back, would I? I really like it, to be able to roll back since I don't know, now which file or which state I'm gonna miss.

      If these 'hardlink' programm works, it would be exactly that (checking checksum and hardlinking everythink, which exists more than once. Perfect together with Rsnapshot (which basicly is rsync, as far as I understoot) to have it run once a month for ex. purging everything which doubelt in the meantime.
    • Sorry, i don't get the point... It does not have something to do, where to install the "hardlink"-command, or does it???

      The files shall be hardlinked on (or within) the filesystem (partition) were they already are. There's a little space of 116 GB free (out of 8TB), which is not much, when some major change will happen.
    • If the files are part of rsync snapshots, made by rsnapshot, then they already are hardlinked. You wanted to have the next snapshot of the newly structured content to be hardlinked to the old snapshots?

      That may be difficult if you are low on space. I assume that you must have an old snapshot and a new snapshot on the same filesystem in order to successfully combine them using the hardlink utility.

      But I have no direct experience from either rsnapshot or the hardlink utility. I may be wrong...

      If you already have the new snapshot on the filesystem, then you should be able to use the hardlink utility without problems.
      OMV 4, 7 x ODROID HC2, 1 x ODROID HC1, 5 x 12TB, 1 x 8TB, 1 x 2TB SSHD, 1 x 500GB SSD, GbE, WiFi mesh

      The post was edited 1 time, last by Adoby ().

    • OK, now I think I got it.

      There already is quite an amount of snapshots (the latest from last night) on the filesystem. They are fairly hardlinked, so long as Rsnapshot found the Data on the same place in same condition. But since there were some big movements, renaming of folders etc. in the origin-datastrukture in the past, there will be quite a lot of data, which wasn't recognized by Rsnapshot to be hardlinked but is actually identical.

      I'm looking for a possibility to hardlink these Backup-data afterwards in order to save data while still be able to change parts of the origin-folderstrukture without instantly running low on Backup-disk-space

      So IF the 'hardlink'-command works as described, it should work I assume. Maybe there is somebody who already worked with it???
    • rsnapshot is just a frontend for rsync and one main point of rsnapshot is to hardlink files that haven't changed. If the files aren't hardlinked, they must have changed (or they are on different filesystems). There is no utility that is going to fix this.
      omv 4.1.22 arrakis | 64 bit | 4.15 proxmox kernel | omvextrasorg 4.1.15
      omv-extras.org plugins source code and issue tracker - github

      Please read this before posting a question and this and this for docker questions.
      Please don't PM for support... Too many PMs!
    • Well, I understood, that rsnapshot does not hardlink, if the file is just moved or the containing folder renamed. Is this a "different filesystem" already??

      Or will rsnapshot hardlink files, which are identical but have been moved? (it actually does not hardlink them, as I found out in my own experiment)

      But the programm 'hardlink' was described doing so???
    • For rsync (and rsnapshot?) to hard link the files have to be unchanged AND not moved since the last snapshot.

      But, as I said, you can cheat and move the files in the old snapshot, to match the new structure, before you take a new snapshot. Then rsync can hardlink fine.

      And from what I saw of the man page of the hardlink utility it can "fix" hard links between snapshots when the files have been moved but otherwise are unchanged.

      Rather interesting. Would be nice to have a backup utility that can do snapshots and hard links just like rsync, but also can search for candidates to hard link to.

      For instance it might figure out that the files:

      /work/ongoing/project1/big.zip
      /work/finished/2019/project1/big.zip.bak

      actually are the same files but with different names/locations.
      OMV 4, 7 x ODROID HC2, 1 x ODROID HC1, 5 x 12TB, 1 x 8TB, 1 x 2TB SSHD, 1 x 500GB SSD, GbE, WiFi mesh
    • Adoby wrote:

      And from what I saw of the man page of the hardlink utility it can "fix" hard links between snapshots when the files have been moved but otherwise are unchanged.
      Exactly that's what I'm looking for!! It would be ways too much work to fix a folderstrukture in a daily, weekly and monthly saved snapshot-backup-strukture manualy or maybe actually not possible...

      At the moment I'm updating to Debian 9.8 / OMV 4 (was the time to do so anyways), hoping, that installation or use of 'hardlink'-programm will run after that to try it...
    • hardlinks are not what you want if you are moving stuff all over the place and still want them to not take up space. You should be using a CoW filesystem and snapshots. I have about 20 snapshots (using rsnapshot) of almost 6TB worth of data on an 8 TB drive. I don't move things around all the time though.
      omv 4.1.22 arrakis | 64 bit | 4.15 proxmox kernel | omvextrasorg 4.1.15
      omv-extras.org plugins source code and issue tracker - github

      Please read this before posting a question and this and this for docker questions.
      Please don't PM for support... Too many PMs!
    • Well, if you have reorganized the original folder structure on the data storage, there is, hypothetically, no reason why a backup utility, otherwise similar to rsync, couldn't be smart enough during the backup to search for hard link candidates within the previous snapshot in the backup storage. Then renamed/moved folders/files wouldn't matter. They would still be hard linked, if the files match.

      And it wouldn't even be strictly necessary with a CoW or checksummed filesystem. Checksums could be calculated separately.

      You could perhaps specify different "hard link levels" to use when searching for hard link candidates while updating snapshots. From none over standard to aggressive: 0=no hl, 1=hl if matching path+name+size+time, 2=hl if matching chksum+name+size+time, 3=hl if matching chksum+size.

      Bausau wrote:

      Exactly that's what I'm looking for!! It would be ways too much work to fix a folderstrukture in a daily, weekly and monthly saved snapshot-backup-strukture manualy or maybe actually not possible...
      Well, you should only need to update the latest snapshot. That is the only one that is used for hard links. I assume...

      But I typically purge old snapshots, except the latest, as well. If I change something. If the folder structure has changed I wouldn't want to restore something that use the old structure. But then I backup several folder trees separately depending on their contents. For instance only movies in one snapshot backup structure. And only ebooks in another. And so on... And I launch the rsync snapshot script sequentially for each folder tree that is backed up.
      OMV 4, 7 x ODROID HC2, 1 x ODROID HC1, 5 x 12TB, 1 x 8TB, 1 x 2TB SSHD, 1 x 500GB SSD, GbE, WiFi mesh

      The post was edited 1 time, last by Adoby ().

    • Adoby wrote:

      couldn't be smart enough during the backup to search for hard link candidates within the previous snapshot in the backup storage. Then renamed/moved folders/files wouldn't matter. They would still be hard linked, if the files match.

      And it wouldn't even be strictly necessary with a CoW or checksummed filesystem. Checksums could be calculated separately.
      I suggest this because it is easy to still get at the backed up files structure like rsync/rsnapshot. If you had to checksum all the files and you had a lot of data, it might be a struggle to be able to execute a backup hourly. If the filesystem way is not the way, then borgbackup is very good since it dedupes at the block level and compresses. It also allows you to mount backups.
      omv 4.1.22 arrakis | 64 bit | 4.15 proxmox kernel | omvextrasorg 4.1.15
      omv-extras.org plugins source code and issue tracker - github

      Please read this before posting a question and this and this for docker questions.
      Please don't PM for support... Too many PMs!
    • After some testing for me 'hardlink' ( manpages.debian.org/stretch/hardlink/hardlink.1.en.html ) did exactly what was needed.
      It checks all contend of a path and hardlinks similar (hash) files. There is an option to do so even if the filename ore timestamp changed, which is grate, if for ex. you rename all fotos DSCF0123.raf to YYYYMMDD_DSCF0123.raf and Rsnapshot has already done one or more copies...
    • One problem with hardlink might be if you are running it on a very large set of files and the computer has little RAM. But other than that it seems to be very nice.

      For instance it could be run as part of a rsync snapshot script. Then it could fix hardlinks for files that has been moved or renamed since previous snapshot, but otherwise are unchanged. I will update my rsync snapshot scripts!

      I have had problems with new snapshots taking up a lot of space after restructuring. Also files in new/download folders will be handled efficiently. No new copies when they are just renamed or moved to the right subfolders.

      Edit: "hadori", HArdlinking DOne RIght seems to be an alternative that is less memory intensive. It does a full compare if the sizes match. Written in C++ instead of C and smaller and simpler code. I like that. It does not seem to care about xattr/ACLs, from the code, but that is fine for me.

      But hadori is not available in the Stretch repos. A spelling error in the description? But it is in the Ubuntu, Buster and Sid repos.

      https://packages.debian.org/search?keywords=hadori

      I will use hadori instead of hardlink.
      OMV 4, 7 x ODROID HC2, 1 x ODROID HC1, 5 x 12TB, 1 x 8TB, 1 x 2TB SSHD, 1 x 500GB SSD, GbE, WiFi mesh

      The post was edited 3 times, last by Adoby ().

    • Nope! Didn't work out.

      I tried using rsync snapshots from script, followed by "deduplication" of the backup/snapshot with hadori.

      Individually rshync and hadori work fine. But in combination it is bad.

      Rsync snapshots are hardlinked between each other, so only new/updated files need to be copied. But hadori updates timestamps for files when it "depuplicate". That means that rsync will copy the hardlinked files next time. And then hadori will have to again hardlink and update timestamps for them. And that is sloooow everytime instead of faster the sercond time...

      The timestamp for a file is conected to the inode, not the directory entry. So every hardlink has the same timestamp. I didn't take that into account.

      The effect is that over time the backup snapshots take much more space than ever, instead of less. <X

      Ideally an improved rsync is needed that combine rsync and hadori and remember hashes for files between runs to avoid having to compare many files.
      OMV 4, 7 x ODROID HC2, 1 x ODROID HC1, 5 x 12TB, 1 x 8TB, 1 x 2TB SSHD, 1 x 500GB SSD, GbE, WiFi mesh