Deduplication of Rsnapshot Backup-Data possible??

  • Hello everybody!


    Is there any way to consolidate backup-data from incremental Rnsapshot backups?



    My problem: Disksize on backup-drive is getting low, since i had and still will have some bigger data moves inside my datastrukture. (ex. import of photos of a year, correction of a typo in a parent-folder). All of these causes a copy instead of a hardlink, even though there's actually no change of the data itself.


    Is there a possibility to deduplicate the data? I'm shure, that there will be a quite big potential to save data size.


    google got me to a program called "hardlink" ( https://manpages.debian.org/st…rdlink/hardlink.1.en.html or https://www.dinotools.de/2013/…atz-sparen-mit-hardlinks/ )


    Has anyone experience with this?


    I'm Linux-beginner, everytime again, when I've to maintain my NAS. Therefore I was not able to test it myself (probably caused by to old OMV 2.2-Version... not shure about that)



    every advice highly welcome!

    • Offizieller Beitrag

    I use rsync directly, not rsnspshot, for my incremental snapshot style backups. I suspect the backup storage is very similar.


    When I restructure the data storage I typically also purge all but the latest snapshot and restructure the latest snapshot to mimic the new data storage.That way I avoid that the same file is stored twice in the backup storage. To be doubly safe you could also delete the old snapshot when the new has finished.


    A clever backup utility should perhaps automatically hardlink to the same file in the backup storage even if it is in the wrong place, using a combination of file metadata and file checksum. Could also handle duplicate files in backup storage the same way.

  • Well, doing so i would not be able to roll back, would I? I really like it, to be able to roll back since I don't know, now which file or which state I'm gonna miss.


    If these 'hardlink' programm works, it would be exactly that (checking checksum and hardlinking everythink, which exists more than once. Perfect together with Rsnapshot (which basicly is rsync, as far as I understoot) to have it run once a month for ex. purging everything which doubelt in the meantime.

  • Sorry, i don't get the point... It does not have something to do, where to install the "hardlink"-command, or does it???


    The files shall be hardlinked on (or within) the filesystem (partition) were they already are. There's a little space of 116 GB free (out of 8TB), which is not much, when some major change will happen.

    • Offizieller Beitrag

    If the files are part of rsync snapshots, made by rsnapshot, then they already are hardlinked. You wanted to have the next snapshot of the newly structured content to be hardlinked to the old snapshots?


    That may be difficult if you are low on space. I assume that you must have an old snapshot and a new snapshot on the same filesystem in order to successfully combine them using the hardlink utility.


    But I have no direct experience from either rsnapshot or the hardlink utility. I may be wrong...


    If you already have the new snapshot on the filesystem, then you should be able to use the hardlink utility without problems.

  • OK, now I think I got it.


    There already is quite an amount of snapshots (the latest from last night) on the filesystem. They are fairly hardlinked, so long as Rsnapshot found the Data on the same place in same condition. But since there were some big movements, renaming of folders etc. in the origin-datastrukture in the past, there will be quite a lot of data, which wasn't recognized by Rsnapshot to be hardlinked but is actually identical.


    I'm looking for a possibility to hardlink these Backup-data afterwards in order to save data while still be able to change parts of the origin-folderstrukture without instantly running low on Backup-disk-space


    So IF the 'hardlink'-command works as described, it should work I assume. Maybe there is somebody who already worked with it???

    • Offizieller Beitrag

    rsnapshot is just a frontend for rsync and one main point of rsnapshot is to hardlink files that haven't changed. If the files aren't hardlinked, they must have changed (or they are on different filesystems). There is no utility that is going to fix this.

    omv 7.0-32 sandworm | 64 bit | 6.5 proxmox kernel

    plugins :: omvextrasorg 7.0 | kvm 7.0.9 | compose 7.0.9 | cputemp 7.0 | mergerfs 7.0.3


    omv-extras.org plugins source code and issue tracker - github


    Please try ctrl-shift-R and read this before posting a question.

    Please put your OMV system details in your signature.
    Please don't PM for support... Too many PMs!

  • Well, I understood, that rsnapshot does not hardlink, if the file is just moved or the containing folder renamed. Is this a "different filesystem" already??


    Or will rsnapshot hardlink files, which are identical but have been moved? (it actually does not hardlink them, as I found out in my own experiment)


    But the programm 'hardlink' was described doing so???

    • Offizieller Beitrag

    For rsync (and rsnapshot?) to hard link the files have to be unchanged AND not moved since the last snapshot.


    But, as I said, you can cheat and move the files in the old snapshot, to match the new structure, before you take a new snapshot. Then rsync can hardlink fine.


    And from what I saw of the man page of the hardlink utility it can "fix" hard links between snapshots when the files have been moved but otherwise are unchanged.


    Rather interesting. Would be nice to have a backup utility that can do snapshots and hard links just like rsync, but also can search for candidates to hard link to.


    For instance it might figure out that the files:


    /work/ongoing/project1/big.zip
    /work/finished/2019/project1/big.zip.bak


    actually are the same files but with different names/locations.

  • And from what I saw of the man page of the hardlink utility it can "fix" hard links between snapshots when the files have been moved but otherwise are unchanged.

    Exactly that's what I'm looking for!! It would be ways too much work to fix a folderstrukture in a daily, weekly and monthly saved snapshot-backup-strukture manualy or maybe actually not possible...


    At the moment I'm updating to Debian 9.8 / OMV 4 (was the time to do so anyways), hoping, that installation or use of 'hardlink'-programm will run after that to try it...

    • Offizieller Beitrag

    hardlinks are not what you want if you are moving stuff all over the place and still want them to not take up space. You should be using a CoW filesystem and snapshots. I have about 20 snapshots (using rsnapshot) of almost 6TB worth of data on an 8 TB drive. I don't move things around all the time though.

    omv 7.0-32 sandworm | 64 bit | 6.5 proxmox kernel

    plugins :: omvextrasorg 7.0 | kvm 7.0.9 | compose 7.0.9 | cputemp 7.0 | mergerfs 7.0.3


    omv-extras.org plugins source code and issue tracker - github


    Please try ctrl-shift-R and read this before posting a question.

    Please put your OMV system details in your signature.
    Please don't PM for support... Too many PMs!

    • Offizieller Beitrag

    Well, if you have reorganized the original folder structure on the data storage, there is, hypothetically, no reason why a backup utility, otherwise similar to rsync, couldn't be smart enough during the backup to search for hard link candidates within the previous snapshot in the backup storage. Then renamed/moved folders/files wouldn't matter. They would still be hard linked, if the files match.


    And it wouldn't even be strictly necessary with a CoW or checksummed filesystem. Checksums could be calculated separately.


    You could perhaps specify different "hard link levels" to use when searching for hard link candidates while updating snapshots. From none over standard to aggressive: 0=no hl, 1=hl if matching path+name+size+time, 2=hl if matching chksum+name+size+time, 3=hl if matching chksum+size.


    Exactly that's what I'm looking for!! It would be ways too much work to fix a folderstrukture in a daily, weekly and monthly saved snapshot-backup-strukture manualy or maybe actually not possible...

    Well, you should only need to update the latest snapshot. That is the only one that is used for hard links. I assume...


    But I typically purge old snapshots, except the latest, as well. If I change something. If the folder structure has changed I wouldn't want to restore something that use the old structure. But then I backup several folder trees separately depending on their contents. For instance only movies in one snapshot backup structure. And only ebooks in another. And so on... And I launch the rsync snapshot script sequentially for each folder tree that is backed up.

    • Offizieller Beitrag

    couldn't be smart enough during the backup to search for hard link candidates within the previous snapshot in the backup storage. Then renamed/moved folders/files wouldn't matter. They would still be hard linked, if the files match.


    And it wouldn't even be strictly necessary with a CoW or checksummed filesystem. Checksums could be calculated separately.

    I suggest this because it is easy to still get at the backed up files structure like rsync/rsnapshot. If you had to checksum all the files and you had a lot of data, it might be a struggle to be able to execute a backup hourly. If the filesystem way is not the way, then borgbackup is very good since it dedupes at the block level and compresses. It also allows you to mount backups.

    omv 7.0-32 sandworm | 64 bit | 6.5 proxmox kernel

    plugins :: omvextrasorg 7.0 | kvm 7.0.9 | compose 7.0.9 | cputemp 7.0 | mergerfs 7.0.3


    omv-extras.org plugins source code and issue tracker - github


    Please try ctrl-shift-R and read this before posting a question.

    Please put your OMV system details in your signature.
    Please don't PM for support... Too many PMs!

    • Offizieller Beitrag

    One problem with hardlink might be if you are running it on a very large set of files and the computer has little RAM. But other than that it seems to be very nice.


    For instance it could be run as part of a rsync snapshot script. Then it could fix hardlinks for files that has been moved or renamed since previous snapshot, but otherwise are unchanged. I will update my rsync snapshot scripts!


    I have had problems with new snapshots taking up a lot of space after restructuring. Also files in new/download folders will be handled efficiently. No new copies when they are just renamed or moved to the right subfolders.


    Edit: "hadori", HArdlinking DOne RIght seems to be an alternative that is less memory intensive. It does a full compare if the sizes match. Written in C++ instead of C and smaller and simpler code. I like that. It does not seem to care about xattr/ACLs, from the code, but that is fine for me.


    But hadori is not available in the Stretch repos. A spelling error in the description? But it is in the Ubuntu, Buster and Sid repos.


    https://packages.debian.org/search?keywords=hadori


    I will use hadori instead of hardlink.

    • Offizieller Beitrag

    Nope! Didn't work out.


    I tried using rsync snapshots from script, followed by "deduplication" of the backup/snapshot with hadori.


    Individually rshync and hadori work fine. But in combination it is bad.


    Rsync snapshots are hardlinked between each other, so only new/updated files need to be copied. But hadori updates timestamps for files when it "depuplicate". That means that rsync will copy the hardlinked files next time. And then hadori will have to again hardlink and update timestamps for them. And that is sloooow everytime instead of faster the sercond time...


    The timestamp for a file is conected to the inode, not the directory entry. So every hardlink has the same timestamp. I didn't take that into account.


    The effect is that over time the backup snapshots take much more space than ever, instead of less. <X


    Ideally an improved rsync is needed that combine rsync and hadori and remember hashes for files between runs to avoid having to compare many files.

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!