Setting to remove duplicated results from index

sugoro · Post by **sugoro** » Tue Jun 20, 2017 10:00 pm

Use-case: when remapping paths for software like DrivePool (viewtopic.php?t=1572 and viewtopic.php?f=4&p=17171), we will often have "duplicated" results in the database. I say "duplicated" because they are not technically duplicated entries, as they belong to different drives. But, with remapping, they will map to exact duplicates in the db.

For example, we could have two drives mounted to folders Drive1 and Drive2, in a "mirror" configuration, where File.txt is duplicated to both drives. Like so
C:\Drive1\File.txt
C:\Drive2\File.txt

We then remap those, to point to the actual pooled drive, say, at D:\
Then, File.txt is accessed with D:\File.txt

In the db, we'll have D:\File.txt twice.

This setting toggle would remove the duplicates (possibly after sorting) and the db would not contain any exact duplicate entries.

Thanks for reading!

Post by **void** » Wed Jun 21, 2017 6:54 am

Is excluding one of the mirror drives possible? eg: C:\Drive2

To exclude a folder in Everything:

In Everything, from the Tools menu, click Options.
Click the Exclude tab on the left.
Click Add Folder....
Select c:\drive 2 and click OK.
Click OK.

sugoro · Post by **sugoro** » Wed Jun 21, 2017 11:11 am

void wrote:Is excluding one of the mirror drives possible? eg: C:\Drive2

To exclude a folder in Everything:
In Everything, from the Tools menu, click Options.

Click the Exclude tab on the left.

Click Add Folder....

Select c:\drive 2 and click OK.

Click OK.

Yes, for simple cases. It won't work very well for more complicated duplication scenarios. I have specific duplication rules, to maximize space (no point duplicating backups that are already stored in another location, offsite).
Some folders are in 4 drives, others in 3, others in 2.

Also, you set rules like "keep 3 copies of this folder's contents" but you usually don't tell the program to "keep those folders in those 3 drives". It will place the files in whichever drive it determines to be the best, and files can be moved to other drives during its balancing routine.

Because of this, there's no to ignore "this folder, on these drives, except this one", since parts if folder will live in different drives, depending on how many drives you have in the pool and your duplication/placement rules.

Thanks for the reply!

dlong500 · Post by **dlong500** » Mon Sep 14, 2020 7:53 pm

@void Adding a feature to hide duplicate full paths would be extremely useful in a complex configuration using a pooling software like DrivePool. Excluding specific disks won't help because DrivePool handles it's own duplication algorithms (disks aren't simple mirrors). But it seems like it should be fairly simple to track duplicated index entries in such a scenario because the full path, size, and date will be exactly the same for duplicate files on drives that have been mapped to a virtual pooled drive.

For example, let's say we have drive P: and drive Q: representing volumes on physical disks, and we remap both of those to a virtual drive X:

If we have a file (test.txt) that exists on:
P:\PoolPart.xxx\test.txt
Q:\PoolPart.xxx\test.txt

the everything index will show:
X:\test.txt
X:\test.txt

Couldn't there a way to be able to detect a duplicated index entry so we could hide one (or more) of the same rows in the GUI?

dlong500 · Post by **dlong500** » Wed Aug 18, 2021 10:14 pm

Some of these posts are obscuring the original issue. The point of this thread is using Everything with a storage pooling software like DrivePool. The FAQ covering duplicated results don't address the issues with pooling software, and while using a folder index can technically be considered a workaround it pretty much defeats the purpose of using Everything since you lose fast NTFS indexing. Remapping the volumes works to aggregate the separate drive indexes so that it correctly shows the virtual pooled drive, so that part is working great, but of course it shows multiple entries for the same file when there are redundant copies in the storage pool.

If a feature could be added to deduplicate the index itself when using remapped volumes that would fix the problem entirely and we wouldn't need to sacrifice the speed of NTFS journaling.

Post by **void** » Thu Aug 19, 2021 4:51 am

If Everything is showing duplicated results for a single drive, please see: Duplicated results.

Moved unrelated posts here: Duplicated results

Post by **void** » Thu Aug 19, 2021 5:04 am

I have recently added a distinct: search function to Everything 1.5.

Please try including distinct: sort:"full path" in your search.

The distinct: search will list only unique files based on the current sort. (removes duplicated full paths from the results).

It is important to specify the sort with distinct:
You can change the sort after searching. Any duplicated results will remain removed.
Double click DUPE in the status bar to clear the distinct: search.

There is a performance hit with sorting by full path, so combine distinct: with other search parameters for the best performance.

To improve full path sorting performance:

In Everything, from the Tools menu, click Options.
Click the Properties tab on the left.
Click Add....
Select Full Path and click OK.
Check Fast sort.
Click OK.

Please let me know if this search helps.

dlong500 · Post by **dlong500** » Sat Aug 28, 2021 4:18 am

@void Thanks so much for addressing this issue!

Adding distinct: in front of any search I make DOES seem to resolve my issue with using remapped drives (in the context of using DrivePool with the custom parameters specified in this thread). I see only one line in the index for each file in a DrivePool drive even when there is redundancy specified within DrivePool settings.

However, that appears to be ALL that is necessary. I don't need to use any path sorting, and the performance doesn't seem to suffer either. But if I add a new file within a DrivePool drive that has redundancy/duplication after the search then I see duplicates for the new file in the list. If I double click on "DUPE" in the status bar the duplicate listing for the newly created file goes away too.

Could there be any way to add an option for a permanent "distinct" setting? And also the ability for the distinct option to apply in realtime for any new additions to the index? If that were possible I think everything would work perfectly under my scenario.

I'm happy to provide for clarity/feedback if you want, or to do testing on any new builds.

Thanks again for all that you do!

Post by **void** » Wed Sep 01, 2021 9:21 am

Consider adding distinct: sort:full-path to your Everything filter:

In Everything, from the Search menu, click Organize filters....
Select Everything and click Edit....
Change the Search to:
distinct: sort:full-path
Click OK.
Click OK.

Or, consider adding a new filter:

In Everything, from the Search menu, click Add to filters....
Change the Name to:
Distinct
Change the Search to:
distinct: sort:full-path
Click OK.

Filters can be activated from the Search menu, Filter bar (View -> Filters), right clicking the status bar, filter macro or filter keyboard shortcut.

dlong500 · Post by **dlong500** » Sat Sep 04, 2021 12:40 am

@void, adding distinct to the base filter certainly improves on having to enter it each time, but it also makes the filtering system more complex (every filter would need to include the distinct parameter in addition to whatever other parameters are specified). That's a minor gripe, but still something to consider.

Of more importance is the issue of the distinct parameter only working as a snapshot and not in real time. Any new files that match the search will still show up duplicated on pooled storage with redundant file copies. The search has to be manually refreshed each time new files are created to get new duplicate results to go away. If there could be a way to make distinct operate on any newly indexed results (when monitor changes is active) in addition to the initial snapshot that would resolve the issue.

I guess the bigger issue for me is wondering why anyone would ever want a duplicated result to show up in the index at all in the context of pooled storage with remapped NTFS drives pointing to a single pooled virtual drive. What use case would there be to show completely duplicate lines in the result list? I certainly understand that many people who don't have a complicated storage situation wouldn't want the performance hit of forcing a dedupe operation, but for situations like mine it would greatly reduce the complexity to simply have a single "distinct" option in the Indexes > NTFS section for each physical drive such that it would deduplicate results in real time. This would eliminate the need to mess with filters at all for deduplication and keep the filtering more simplified for other "real" filtering choices. It should of course be disabled by default to avoid deduplication in more simple scenarios when there is no need.

Post by **void** » Wed Sep 08, 2021 11:24 am

Thanks for the reply dlong500,

distinct: sort:fullpath is not the best option for deduping pooled storage.
A better solution is needed.

distinct: is not real-time.
There would be a large performance hit for Everything to re-check the distinct state for all duplicates on every single file change.
I will consider adding an option to do this.

dlong500 · Post by **dlong500** » Sat Jun 10, 2023 7:08 pm

void wrote: ↑Wed Sep 08, 2021 11:24 am distinct: sort:fullpath is not the best option for deduping pooled storage.
A better solution is needed.

distinct: is not real-time.
There would be a large performance hit for Everything to re-check the distinct state for all duplicates on every single file change.
I will consider adding an option to do this.

Just pinging this thread again to see if you've thought anymore about a better solution for deduping pooled storage. The app is still useful to me even with duplicate results, but it certainly clutters up the interface and makes it harder to use. I would love some type of real-time optional index dedupe (optional so any performance penalty wouldn't be forced onto users who don't want to use it).

Post by **void** » Sat Jun 10, 2023 10:03 pm

The Everything Server will now dedupe filenames.

dlong500 · Post by **dlong500** » Mon Jul 17, 2023 5:05 pm

void wrote: ↑Sat Jun 10, 2023 10:03 pm The Everything Server will now dedupe filenames.

Just following up here to say thanks. I've been testing out using v1.5 alpha with Everything Server for a few weeks now to dedupe file paths on a Drivepool volume and it has been working great! The configuration to get everything set up is a bit complex but it does work well.

klepp0906 · Post by **klepp0906** » Sun Apr 28, 2024 12:00 pm

bumping this old thread as opposed to creating a new one since this was the information I was working off of when attempting to troubleshoot the issue (duplicate entries from pooled storage)

the other day I found myself searching for a particular folder and it wasnt turning up. this was after having created a filter with distinct: to workaround the issue.

i assumed/it appeared to be working great, however the fact that i'm now seeing the following leads me to believe this is going to happen all over the place. hoping you can recognize the root cause of the issue and offer a solution if one exists.

here I am searching for a particular folder called "Assets"

It is nested under E:\Emulation\Media\General\Unsorted

My plan was to search the E: drive which is made up of about a dozen drives pooled with drivepool. If I do so with distinct: added to my filter it displays a single "Assets" folder and its not the one I'm after.

: 2024-04-28_07-53-39.PNG (381.33 KiB) Viewed 3498 times

If I remove distinct: from the filter then it shows up but that includes its duplicate and all other "Assets" folders (desireable) but also their duplicates (undesireable).

: 2024-04-28_07-51-37.PNG (464.54 KiB) Viewed 3498 times

It seems distinct: is applying to the name and name only and not distinguishing based on the virtual path or anything else that could differentiate one "Assets" from another non dupe version.

As i said at the beginning, if it's happening here its happening everywhere so I want to get this sorted for obvious reasons.

Post by **void** » Sun Apr 28, 2024 12:04 pm

distinct: will find distinct: names.

Please try the following search to find distinct full paths:

distinct:fullpath

klepp0906 · Post by **klepp0906** » Sun Apr 28, 2024 12:15 pm

void wrote: ↑Sun Apr 28, 2024 12:04 pm distinct: will find distinct: names.

Please try the following search to find distinct full paths:

distinct:fullpath

its funny, i posted this knowing you'd reply and being all but entirely confident you'd have a solution. (didnt expect it to be so fast though!) it must be morning coffee time

what im trying to say is I (we all) appreciate what you do. whenever i come across a post or site or survey etc inquiring about "essential" or "must-have" softwares, this is always the first to come to mind.

it has enabled me to do such a broad array of work that would not have been possible without it. heck, its more powerful than I can even wrap my brain around as far as potential use-cases go.

so thanks for the reply which bore fruit, and thanks for developing it!

i now understand the filtering better. i was going to note that for my use case distinct:fullpath was going to be the preferable option as a prefix for all the built:in filters. That should have it functioning the way I want and allow me to use the checkboxes should I want to narrow down case etc. (seems you realized this and edited your post to remove the case: portion though!)

do you take donations anywhere?

Post by **void** » Sun Apr 28, 2024 12:20 pm

case: isn't really needed.

without case: you may inadvertently remove some paths, eg:

c:\music\Röyksopp
c:\music\royksopp

Use case:distinct:fullpath if you want these both to show up.

https://www.voidtools.com/donate/
Thank you for your support.

voidtools forum

Setting to remove duplicated results from index

Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index

Re: Setting to remove duplicated results from index