Search files and sort them by cli

riff-raff · 10. Juli 2017

I want to search a folder and its subfolders by file type and move or copy them to another folder (directly, no subfolders any more). There will be some assumed duplicates which aren't duplicates. I would like these to be copied in another folder.

No problem to find and copy the files, but only interactive for the duplicates. Cause there will be >50k files and something arround 5k duplicates, a little more automatics is needed than i had with this command:

Code

find /srv/dev-disk-by-xxx/folder1 -type f -name '*.cvs' -exec cp -i /srv/dev-disk-by-xxx/folder2 {} \;

The command asks me for every duplicate but I want them to get copied directly to /srv/disc-by-xxx/folder3. Could this be possilbe with 2x find and an if-then-else-loop?

tkaiser · 10. Juli 2017

Bash

#!/bin/bash
find /srv/dev-disk-by-xxx/folder1 -type f -name '*.cvs' | while read ; do
	if [ -f "/srv/dev-disk-by-xxx/folder2/${REPLY##*/}" ]; then
		cp -i "${REPLY}" /srv/dev-disk-by-xxx/folder3/
	else
		cp "${REPLY}" /srv/dev-disk-by-xxx/folder2/
	fi
done

riff-raff · 10. Juli 2017

thank you for your help!

Code

cp -i "${REPLY}" /srv/dev-disk-by-xxx/folder3/

the option -i should not be neccessary or is it? Just in case there is triplet of files?

tkaiser · 10. Juli 2017

Zitat von riff-raff

the option -i should not be neccessary or is it? Just in case there is triplet of files?

The latter. But I would do it differently and compare also the file's contents. Something like

Code

MD5Sum=$(md5sum "${REPLY}" | awk -F" " '{print $1}')

creates a hash over the contents so you could do fine grained comparisons and sort also based on contents.

I usually do tasks like this in this way

Bash

#!/bin/bash
find /srv/dev-disk-by-xxx/folder1 -type f -name '*.cvs' | while read ; do
	MD5Sum=$(md5sum "${REPLY}" | awk -F" " '{print $1}')
	echo -e "${MD5Sum}\t${REPLY##*/}"
done >/tmp/list.txt

And then let list.txt be processed (sorting by column combined with 'uniq -d' to get only duplicates listed)

tkaiser · 10. Juli 2017

Searching for files with same contents would then be simply this

Bash

#!/bin/bash
awk '{print $1}' </tmp/list.txt | sort | uniq -d | while read ; do
	grep "^${REPLY}" /tmp/list.txt
done

Searching for file with same filenames then just

Bash

#!/bin/bash
awk '{print $2}' </tmp/list.txt | sort | uniq -d | while read ; do
	grep "${REPLY}$" /tmp/list.txt
done

riff-raff · 10. Juli 2017

This list only contains duplicates with same content?

There are 2 types of duplicates in my folder: duplicates in name, but with different content and real duplicates with same content, but different names. To sort these is quite frustrating.

tkaiser · 10. Juli 2017

Zitat von riff-raff

This list only contains duplicates with same content?

The list contains all filenames and all MD5 hashes. Last post above with the two script snippets shows how you can use sort/uniq to further process the list to find everything that's interesting. Since nothing is actually moved you can simply try it out.

If you want to start postprocessing the above stuff should be somewhat save wrt 'creative filenames' (I do this for a living, crawling through directories on Unix servers where macOS users store their stuff. They use literally every special character possible )

Finding files with identical name and also contents is then simply

Code

sort </tmp/list.txt | uniq -d

riff-raff · 10. Juli 2017

Well, hashing all the files seems to take quite a while. I'll stick to your recommendation and post my experiences later on. Thank you very much for your help.

tkaiser · 10. Juli 2017

Zitat von riff-raff

hashing all the files seems to take quite a while

Yeah, it's essentially reading all contents of the files matching the name pattern and the find run is also already some sort of a benchmark.

This variant would provide some user feedback to stdout while writing to list.txt at the same time:

Bash

#!/bin/bash
find /srv/dev-disk-by-xxx/folder1 -type f -name '*.cvs' | while read ; do
	MD5Sum=$(md5sum "${REPLY}" | awk -F" " '{print $1}')
	echo -e "${MD5Sum}\t${REPLY##*/}" | tee -a /tmp/list.txt
done

(but I dealt once with a tee variant that implemented fsync() pretty closely and writing the list file became almost a bottleneck)

riff-raff · 10. Juli 2017

Bash

#!/bin/bash
awk '{print $1}' </tmp/list.txt | sort | uniq -d | while read ; do
	grep "^${REPLY}" /tmp/list.txt
done

Works well for files with same content

Bash

#!/bin/bash
awk '{print $2}' </tmp/list.txt | sort | uniq -d | while read ; do
	grep "${REPLY}$" /tmp/list.txt
done

Throws an error -> grep argument

By listing the files with same content a later on process to remove the duplicate .... there are only a small variaties in naming ...

The files with same name can be total different files, I guess I need to check them manually. But these files should only be 100 or 200 ...

tkaiser · 10. Juli 2017

Zitat von riff-raff

By listing the files with same content a later on process to remove the duplicate

Now guess what could've happened if you would've asked for 'how to find and possibly delete duplicates?' in the first place?

At least now I'll recommend doing an

Code

sudo apt install fdupes
man fdupes

riff-raff · 10. Juli 2017

sometimes you can't see whats the way to do ...

cause i have these two types of douplicates, I did not think of that solution.

at first my aim was to sort and check the duplicates by name (same "content", but different quality) by hand.

same content is clear, get rid of it, thats storage waste

Edit: I used fdupes to get rid of content identical duplicates. Thx for that hint.

Another question: All the files are located in one folder with >20k subfolders sorted in alphabetical order. I want to move all folders (containing files) with beginning letter "A" to another location. I was searching the internet for a while, but I could not come up wih a solution for this, since the solution has to be done with cli.

tkaiser · 14. Juli 2017

Zitat von riff-raff

I want to move all folders (containing files) with beginning letter "A" to another location.

Code

find $lala -name "A*" -type d | $filter | while read; do
    mv "${REPLY}" $somewhere
done

('| $filter' being optional of course)

riff-raff · 14. Juli 2017

Does not work. I want to move all folders beginning with "A" no matter what files are in there.

tkaiser · 14. Juli 2017

Zitat von riff-raff

Does not work

It does work this or that way

Well, this is basic scripting stuff totally unrelated to OMV. Maybe you need -maxdepth or -mindepth operators too. IMO it's worth to learn these basics if you have to do this stuff more often (otherwise it's a waste of time!)

riff-raff · 14. Juli 2017

I do not need it often, but it is quite good to know, so no waste of time.

Search files and sort them by cli

chente 8. November 2021

Jetzt mitmachen!