Search files and sort them by cli

  • I want to search a folder and its subfolders by file type and move or copy them to another folder (directly, no subfolders any more). There will be some assumed duplicates which aren't duplicates. I would like these to be copied in another folder.


    No problem to find and copy the files, but only interactive for the duplicates. Cause there will be >50k files and something arround 5k duplicates, a little more automatics is needed than i had with this command:


    Code
    find /srv/dev-disk-by-xxx/folder1 -type f -name '*.cvs' -exec cp -i /srv/dev-disk-by-xxx/folder2 {} \;

    The command asks me for every duplicate but I want them to get copied directly to /srv/disc-by-xxx/folder3. Could this be possilbe with 2x find and an if-then-else-loop?

    Chaos is found in greatest abundance wherever order is being sought.
    It always defeats order, because it is better organized.
    Terry Pratchett

  • thank you for your help!


    Code
    cp -i "${REPLY}" /srv/dev-disk-by-xxx/folder3/


    the option -i should not be neccessary or is it? Just in case there is triplet of files?

    Chaos is found in greatest abundance wherever order is being sought.
    It always defeats order, because it is better organized.
    Terry Pratchett

  • the option -i should not be neccessary or is it? Just in case there is triplet of files?

    The latter. But I would do it differently and compare also the file's contents. Something like

    Code
    MD5Sum=$(md5sum "${REPLY}" | awk -F" " '{print $1}')

    creates a hash over the contents so you could do fine grained comparisons and sort also based on contents.


    I usually do tasks like this in this way


    Bash
    #!/bin/bash
    find /srv/dev-disk-by-xxx/folder1 -type f -name '*.cvs' | while read ; do
    	MD5Sum=$(md5sum "${REPLY}" | awk -F" " '{print $1}')
    	echo -e "${MD5Sum}\t${REPLY##*/}"
    done >/tmp/list.txt

    And then let list.txt be processed (sorting by column combined with 'uniq -d' to get only duplicates listed)

  • Searching for files with same contents would then be simply this

    Bash
    #!/bin/bash
    awk '{print $1}' </tmp/list.txt | sort | uniq -d | while read ; do
    	grep "^${REPLY}" /tmp/list.txt
    done


    Searching for file with same filenames then just


    Bash
    #!/bin/bash
    awk '{print $2}' </tmp/list.txt | sort | uniq -d | while read ; do
    	grep "${REPLY}$" /tmp/list.txt
    done
  • This list only contains duplicates with same content?


    There are 2 types of duplicates in my folder: duplicates in name, but with different content and real duplicates with same content, but different names. To sort these is quite frustrating.

    Chaos is found in greatest abundance wherever order is being sought.
    It always defeats order, because it is better organized.
    Terry Pratchett

  • This list only contains duplicates with same content?

    The list contains all filenames and all MD5 hashes. Last post above with the two script snippets shows how you can use sort/uniq to further process the list to find everything that's interesting. Since nothing is actually moved you can simply try it out.


    If you want to start postprocessing the above stuff should be somewhat save wrt 'creative filenames' (I do this for a living, crawling through directories on Unix servers where macOS users store their stuff. They use literally every special character possible ;) )


    Finding files with identical name and also contents is then simply

    Code
    sort </tmp/list.txt | uniq -d
  • Well, hashing all the files seems to take quite a while. I'll stick to your recommendation and post my experiences later on. Thank you very much for your help.

    Chaos is found in greatest abundance wherever order is being sought.
    It always defeats order, because it is better organized.
    Terry Pratchett

  • hashing all the files seems to take quite a while

    Yeah, it's essentially reading all contents of the files matching the name pattern and the find run is also already some sort of a benchmark.


    This variant would provide some user feedback to stdout while writing to list.txt at the same time:


    Bash
    #!/bin/bash
    find /srv/dev-disk-by-xxx/folder1 -type f -name '*.cvs' | while read ; do
    	MD5Sum=$(md5sum "${REPLY}" | awk -F" " '{print $1}')
    	echo -e "${MD5Sum}\t${REPLY##*/}" | tee -a /tmp/list.txt
    done

    (but I dealt once with a tee variant that implemented fsync() pretty closely and writing the list file became almost a bottleneck)

  • Bash
    #!/bin/bash
    awk '{print $1}' </tmp/list.txt | sort | uniq -d | while read ; do
    	grep "^${REPLY}" /tmp/list.txt
    done

    Works well for files with same content


    Bash
    #!/bin/bash
    awk '{print $2}' </tmp/list.txt | sort | uniq -d | while read ; do
    	grep "${REPLY}$" /tmp/list.txt
    done

    Throws an error -> grep argument



    By listing the files with same content a later on process to remove the duplicate .... there are only a small variaties in naming ...


    The files with same name can be total different files, I guess I need to check them manually. But these files should only be 100 or 200 ...

    Chaos is found in greatest abundance wherever order is being sought.
    It always defeats order, because it is better organized.
    Terry Pratchett

  • sometimes you can't see whats the way to do ...


    cause i have these two types of douplicates, I did not think of that solution.


    at first my aim was to sort and check the duplicates by name (same "content", but different quality) by hand.


    same content is clear, get rid of it, thats storage waste



    Edit: I used fdupes to get rid of content identical duplicates. Thx for that hint.


    Another question: All the files are located in one folder with >20k subfolders sorted in alphabetical order. I want to move all folders (containing files) with beginning letter "A" to another location. I was searching the internet for a while, but I could not come up wih a solution for this, since the solution has to be done with cli.

    Chaos is found in greatest abundance wherever order is being sought.
    It always defeats order, because it is better organized.
    Terry Pratchett

    3 Mal editiert, zuletzt von riff-raff ()

  • Does not work. I want to move all folders beginning with "A" no matter what files are in there.

    Chaos is found in greatest abundance wherever order is being sought.
    It always defeats order, because it is better organized.
    Terry Pratchett

  • I do not need it often, but it is quite good to know, so no waste of time.

    Chaos is found in greatest abundance wherever order is being sought.
    It always defeats order, because it is better organized.
    Terry Pratchett

  • chente

    Hat das Thema geschlossen.

Jetzt mitmachen!

Sie haben noch kein Benutzerkonto auf unserer Seite? Registrieren Sie sich kostenlos und nehmen Sie an unserer Community teil!