On a Find...Duplicates, is the Find smart?

Post by **therube** » Thu Dec 09, 2021 5:07 pm

On a Find...Duplicates, is the Find "smart"?

As in, (with lazyload, rather then Indexed) if you do a 'Find SHA-1 Duplicates', does it calculate the SHA-1 hashes for all files, then display the dups?

Or does it (first) find size dups, & only perform the hash checks on that set of files (as in order for hashes to be the same, size must also be the same, & as size is already indexed...)?

raccoon · Post by **raccoon** » Thu Dec 09, 2021 8:39 pm

AFIAK, Everything does not do anything special or fancy with hash value calculation nor with hash digest file reading... besides dumping that data into a column for user eyeballs and column sorting, which includes standard behavior column-duplicates and column-uniques. There is no re-calculate hash verification, nor size mismatch detection as yet. Just column population as raw data.

Natural hash / filesize collisions are extremely unlikely for any of the 64, 128 (MD5), 160 (SHA-1), or 256+ bit hash digests unless there is a specific cryptographic attacker trying to instigate a collision.

For a 50% probability of a 32 bit hash collision (CRC-32), you'll need only 77163 files on your computer. But for a 50% probability of collision for a 64 bit hash, you'll need to handle roughly 5.06 billion hashed files, or 6.07 million files for a 1-in-a-million chance at a 64 bit collision.[1]

[1] https://preshing.com/20110504/hash-coll ... abilities/

160 bit (SHA-1) 50% collision probability 1,420,000,000,000,000,000,000,000 files.
160 bit (SHA-1) 1-in-a-million probability 1,710,000,000,000,000,000,000 files.

Post by **therube** » Thu Dec 09, 2021 9:27 pm

I have a pair files that I keep - because they are a collision pare

.
(Though collision isn't the question, here.)

Everything does not do anything special or fancy with hash value calculation

Well, if the data (hash) is not indexed, then it does need to calculate it.

If it calculates on the entire data set, & then presents the duplicates, then it is not doing it efficiently.

Because, if files are not of the same size, then their hashes will not compare, so if you do a 'Find SHA-1 Duplicates', in the case where no sizes are dup'd, Everything can immediately return an empty file list. And for the case where there are size dup's, it is only those that the hash needs to be calculated for to verify if 'Find SHA-1 Duplicates' should display them or not.

(Again, I'm speaking more so about "lazyload" rather then indexed data.)

(As it is, I was running some "tests", & Everything was calculating SHA-1's slower then say, AllDup - on the same data set.
9:33 vs 8:25, but I wasn't sure if Everything was actually only checking size dup's or not.)

raccoon · Post by **raccoon** » Thu Dec 09, 2021 9:36 pm

Everything does All-Or-Nothing for all Index Properties, based on the selected filesize, folder and file extension parameters in your Properties settings. They are immediately calculated when you hit the Apply button, and do not wait for you to perform any syntax searching or column sorting. It's the same behavior as indexing Media metadata. It is unclear exactly what properties that Everything supports "on-the-fly" by simply adding a column.

I would strongly recommend against daily driving any of the hash calculating Properties and instead utilize the [hash]sum Properties that attempt to locate pre-calculated .md5 or .sha1 or .sha256 digest files. @void has indicated that all of your hard work calculating sums in Everything will be wiped and destroyed whenever the Index needs to be rebuilt, which can happen at any time.

I estimate sometime in the distant future, @void will allow for dumping Generated Hashes into scattered Digest Files or maybe :AltStreams.

Post by **void** » Fri Dec 10, 2021 4:23 am

On a Find...Duplicates, is the Find "smart"?

No.

Please try right-clicking the Size column and clicking Find Size duplicates.
Right click the result list column header and click Add columns.... -> show the SHA-1 column.
Right click the SHA-1 column and click Find SHA-1 duplicates.

This way you only load the SHA-1 sums for files duplicated by size.

I would strongly recommend against daily driving any of the hash calculating Properties and instead utilize the [hash]sum Properties that attempt to locate pre-calculated .md5 or .sha1 or .sha256 digest files. @void has indicated that all of your hard work calculating sums in Everything will be wiped and destroyed whenever the Index needs to be rebuilt, which can happen at any time.

I agree.
I highly recommend using sha256sum .sha256 files for storing hashes.
Once Everything is in beta/release the database should be stable enough to hold hashes.

raccoon · Post by **raccoon** » Fri Dec 10, 2021 5:17 am

void wrote: ↑Fri Dec 10, 2021 4:23 amPlease try right-clicking the Size column and clicking Find Size duplicates.
Right click the result list column header and click Add columns.... -> show the SHA-1 column.
Right click the SHA-1 column and click Find SHA-1 duplicates.

So it *is* indeed possible to do SHA-1 hashing "on-the-fly" by only adding the SHA-1 column once we have refined our search results and size-duplicates, etc.

I was genuinely unaware that hash generation and property scraping could be done on-the-fly like this. Are all properties supported in this way without pre-indexing them from the Options -> Indexes -> Properties settings panel?

Post by **void** » Fri Dec 10, 2021 5:20 am

I was genuinely unaware that hash generation and property scraping could be done on-the-fly like this. Are all properties supported in this way without pre-indexing them from the Options -> Indexes -> Properties settings panel?

Yes, of course.

Post by **therube** » Fri Dec 10, 2021 3:57 pm

(As it is, I was running some "tests", & Everything was calculating SHA-1's slower then say, AllDup - on the same data set.
9:33 vs 8:25, but I wasn't sure if Everything was actually only checking size dup's or not.)

No.

In that case, Everything would seem to be very efficient at calculating the hashes (considering it was calculating on a much greater data set).

Please try right-clicking the Size column and clicking Find Size duplicates.
Right click the result list column header and click Add columns.... -> show the SHA-1 column.
Right click the SHA-1 column and click Find SHA-1 duplicates.

Precisely for situations like that are times when the suggested "column visible, but disabled" would be handy.
That way, you could have SHA-1 there, but not affecting anything, until you specifically enabled it (only to disable it again, after the need, & all the while, also respecting lazyload).

unaware that hash generation and property scraping could be done on-the-fly like this

Even better, is that if your columns are laid out such that particular Properties are "out of focus" (not visible in the current viewport), then they aren't loaded at all. So you could have Length or SHA-1 or columns enabled, but until brought into focus, their data is not gathered. (Arrow key, mouse, over so the column is brought into focus & [only] those visible rows, for that column, are then loaded.)

raccoon · Post by **raccoon** » Mon Dec 20, 2021 5:29 pm

To make the most use of this method of SHA dupe detection with many files, it is important to evoke this option to enable Everything to calculate hashes for off-screen row objects. As, by default, Everything will only calculate hashes for rows that are visible on-screen (what therube calls "lazy load").

/request_extra_fileinfo_end=10000

This will load column metadata, thumbnails and calculate hash sums for all search results, even those that are off-screen. Make the value larger than 10000 if you expect to be working with more than 10,000 search results. Make the value smaller again, or 0, after you're done.

Without this option, you will have to manually scroll down to look at every row before Everything will calculate the file's hash.

ref Is there a way to force preload Thumbnails?

Post by **therube** » Mon Dec 20, 2021 6:27 pm

Without this option, you will have to manually scroll down to look at every row before Everything will calculate the file's hash.

Without scrolling down, if you click the column header (SHA-1), that too will cause the entire column to load.

raccoon · Post by **raccoon** » Mon Dec 20, 2021 8:12 pm

therube wrote: ↑Mon Dec 20, 2021 6:27 pm
Without this option, you will have to manually scroll down to look at every row before Everything will calculate the file's hash.
Without scrolling down, if you click the column header (SHA-1), that too will cause the entire column to load.

I think that only changes your sort order from ascending to descending. Where in the column header do you have to click to make it load the entire column?

Post by **therube** » Mon Dec 20, 2021 8:20 pm

For (I guess it is) a column that is not already Indexed, clicking the column header will both change the sort to that column & also gather "unreturned" (unloaded) data of it (based on the current results list).

voidtools forum

On a Find...Duplicates, is the Find smart?

On a Find...Duplicates, is the Find smart?

Re: On a Find...Duplicates, is the Find smart?

Re: On a Find...Duplicates, is the Find smart?

Re: On a Find...Duplicates, is the Find smart?

Re: On a Find...Duplicates, is the Find smart?

Re: On a Find...Duplicates, is the Find smart?

Re: On a Find...Duplicates, is the Find smart?

Re: On a Find...Duplicates, is the Find smart?

Re: On a Find...Duplicates, is the Find smart?

Re: On a Find...Duplicates, is the Find smart?

Re: On a Find...Duplicates, is the Find smart?

Re: On a Find...Duplicates, is the Find smart?