Finding a needle in a haystack duplicate

anmac1789 · Post by **anmac1789** » Wed Feb 15, 2023 8:58 pm

Hello, so I've managed to find ALOT of duplicates using a custom column:

ancestor:"C:\Users\main name\path1"|ancestor:"E:\Users\different name\path2" files: add-column:column1 column1:=name:"--"formatfiletime($dc:)"--"formatfiletime($dm:)"--"formatfiletime($da:)"--"size: find-dupes:column1

The total number of results I got is 57,160 files -- 28,587 files are from the C drive path and 28,573 files are from the E drive path. Shouldn't it find the exact half number of files for both C and E drive paths? 57,160 /2 = 28,580 ? Therefore, it seems that there are 28,580 - 28,587 from C results = 7 results which shouldn't be included or is improperly being detected. Similarly for the E drive path 28,580 - 28,573 = 7 results ...

How can I pinpoint what is going on, what are these 7 mysterious files?

raccoon · Post by **raccoon** » Wed Feb 15, 2023 9:28 pm

You can have copies of the same file on the same drive, and they will match unto themselves. There is no specification that the matches must occur across drive volumes, only.

That's why I requested the function Compare-Paths

Post by **therube** » Wed Feb 15, 2023 9:36 pm

Wouldn't UNIQUE give you what you want?



<c:/tmp | c:/out>  !copy file:  abc

(I didn't throw unique into the search line, instead by right-click the column header & 'Find ___ Duplicates'.)

I've got trees
I exclude files with the string copy (cause I have 100K of them)
I look for files that contain abc in those two trees

Then I unique: them, by some category; Name, Size, whatever, & I'm left with files that - don't fit, that are unique.

(Now, I've left the ancestor: & whatnot out, but I'd think it should work just as well thrown in.)

anmac1789 · Post by **anmac1789** » Wed Feb 15, 2023 10:08 pm

raccoon wrote: ↑Wed Feb 15, 2023 9:28 pm You can have copies of the same file on the same drive, and they will match unto themselves. There is no specification that the matches must occur across drive volumes, only.

That's why I requested the function Compare-Paths

I'm also waiting on this function it seems to be highly useful considering how many duplicate paths there are with similar subfolder names in between

therube wrote: ↑Wed Feb 15, 2023 9:36 pm Wouldn't UNIQUE give you what you want?

<c:/tmp | c:/out> !copy file: abc

(I didn't throw unique into the search line, instead by right-click the column header & 'Find ___ Duplicates'.)

I've got trees
I exclude files with the string copy (cause I have 100K of them)
I look for files that contain abc in those two trees

Then I unique: them, by some category; Name, Size, whatever, & I'm left with files that - don't fit, that are unique.

(Now, I've left the ancestor: & whatnot out, but I'd think it should work just as well thrown in.)

Could you be a little specific in your description? What do you mean 'string copy' ?

I found a hack and slash kind of way I am not sure if it will work for your. I had to re-create some excel functions and then translate that into everything using custumn column. here is the custom column I used:

Code: Select all

RIGHT($path:,LEN($path:)-FIND("\Android\",$path:))

I chose this because I had to find the 1st instance of Android\ folder inside the path and then delete everything to the left and go fully to the right and complete the partial path. So far, the results I got were:

57,146 objects - 28,573 for paths beginning with C and 28,573 for paths beginning with E. So far so good...

Here is the full custom column I have:

Code: Select all

files: add-column:column1 column1:=name:"--"RIGHT($path:,LEN($path:)-FIND("\Android\",$path:))"--"formatfiletime($dc:)"--"formatfiletime($dm:)"--"formatfiletime($da:)"--"size: !find-dupes:column1

voidtools forum

Finding a needle in a haystack duplicate

Finding a needle in a haystack duplicate

Re: Finding a needle in a haystack duplicate

Re: Finding a needle in a haystack duplicate

Re: Finding a needle in a haystack duplicate