duplicate finder via cli (rdfind) - proper way to avoid system ressource limit?

kwon · 25. März 2021

Hi there,

I'm using rdfind [1] to convert duplicate files into hardlinks. Since I also use rsnapshot as a backup solution (usually one drive for data, backup1 and backup2 are setup as targets for two separate rsnapshot jobs) quite some space can be saved this way.

Especially when I've build a new OMV and users start using it and rename and move a lot of files.

However OMV with a Pentium (1155 socket) seems to slow down quite horribly and sends ressource limit warnings, while at the same time the CPU-load doesn't seem to be a problem at all (rdfind itself uses at most 30%, mostly around 10-15%).

I thought it might be actually the hardrives causing the slow response, since rdfind of course needs to scan the whole filesystem. However the problem also occurs when I run rdfind on a backupdrive, leaving the data drive alone, which shouldn't slow down then anymore.

Questions:

- any other experiences with solutions to the duplicates problem apart from rdfind? (on ext4, not ZFS depub etc.)

- any idea what the reason for the slowing down of the system might be? Or which way to investigate? I'm kind of out of ideas.

- more general: if i'd want to limit the cpu-load that a specific scheduled job might use what is the proper way to achieve that?

Hardware:

- CPU Pentium G640

- RAM 4GB

- OS 320 GB HDD, 2,5"

- 3x 4TB HDD

- services: daily rsync, rsnapshot, rdfind; smb, openVPN

- clients for openVPN simultaneously: 10 max, usually more like 3-5

[1] https://rdfind.pauldreik.se/

Thanks so far

kwon

kwon · 29. März 2021

Right, so nobody has that issue?

how do you guys deal with duplicates then?

Damn, I'm unlucky with my posts it seems.

If I offer to tell you who *really* killed jfk will that improve my odds?

greg77 · 26. Januar 2022

Hi, sorry for answering to your question almost a year after you asked, but just found it on the forum while searching about rdfind myself.

If you have a data HDD separated from system HDD it doesn't matter and that's not the harddrives loading the CPU. The rdfind command do some calculations while searching and comparing files. That needs some CPU power.

To avoid it you have "-sleep" flag which is basically a delay between scanning each file. That makes the whole process slower, but less CPU greedy.

Check the man page. It's explained there: https://rdfind.pauldreik.se/rdfind.1.html

kwon · 2. Februar 2022

No worries, I'm not that often dropping by here so even nicer to see you're answer.

I was wondering about the CPU as well, but it even occurs on i7 CPUs (1155 and 1150), not that recent models, granted, but still seemed odd, and the last time I checked if I remember correctly the CPU didn't seem to be that high on load either.

But I'll try your suggestion anyways, it's not like I found any other solution in the last year. Well in one case I actually turned the machine down, took the drive out and let the process run on a different machine a couple of times... not a permanent solution.

duplicate finder via cli (rdfind) - proper way to avoid system ressource limit?

kwon 25. März 2021

Jetzt mitmachen!

Tags