Duplicate files in 1.5.2

Today when I was searching in one of my libraries I am seeing hundreds of duplicate files that previously were not there. I am up to over 100 deleted. These files have somewhat common names eg “Learn2Grow Article Template” which is not the true name of the file. The Froms are common too. I virtually all cases so far, there the expected file is in the library along with the strange duplicate versions. Any ideas on what is causing this and how to deduplicate other than going through thousands of files?

Thanks
Frank

The EagleFiler definition of duplicate files is when the contents (data fork) are identical. Which criteria are you using to call these files duplicates?

I’m not sure what you mean here. Where are you seeing “Learn2Grow Article Template” and where are you seeing the “true” name?

Could you give some examples of the full names (the File column) in EagleFiler for the “expected” file and the “duplicates”? Have you been editing these files? If so, using which applications?

Even more duplicates
Michael - it seems that things are getting worse. I quit EagleFiler and restarted now I am seeing even more duplicates. Two example screenshots are attached.

In MultipleDuplicates.jpg you will see the same article appearing 3 times (two instances.) The original is tagged either with a GG or PRB. The duplicates are tagged as unread •.

In Duplicates.jpg you will see duplicate instances of “Bee” Kind to Bees in the other case Bee-come a Bee Hugger also appears multiple times as Learn2Grow® Article Template. In this case the article was imported once as Bee-come a Bee Hugger and never as Learn2Grow® Article Template. Writers are working from the template but submitting with a proper article name and their author name.

Hope this clarifies the problem better.

EagleFiler imports the Title from the title field in the file’s metadata. For Microsoft Word documents, you can set this in the Properties window. Of course, you can also edit the title later in EagleFiler.

It would be nice if you could make the File column wider so that I could see the full filenames.

When you open a library, EagleFiler does a Scan for New Files. That is, it looks for files that you’ve added to the library folder not using EagleFiler and imports them. My guess is that that’s the source of these “duplicates.” That’s why I asked if and how you had been editing these files.

More on duplicates
Michael -

All files imported in this library were imported using the capture key. None come in from the scan for new files. I rarely edit these files they are my database for a website’s source content. I don’t use the Import Folder for this library, and I certainly haven’t tried to add 100+ duplicates to the library either. In fact it has been a month since I did add anything to the library and that was done via the capture key. This issue just popped up over the last few days.

Duplicates2.jpg is a wider screen view a sample of the duplicates. Hope this provides what you need.

If you didn’t import them yourself, then they must have come from Scan for New Files. First I suggest looking in your library folder in the Finder. Do you see the duplicate files there? If so, there’s your answer. If not (which is what I suspect given the last screenshot) there’s probably some sort of malfunction with either Scan for New Files or the filesystem such that EagleFiler thinks the existing files are new. To see if this is the case, you could turn off Scan for New Files and see if that helps.

More on duplicates
Michael - I seem to be having problems posting a response to your last comments. Let me try again. If you received the prior attempted posts, hopefully I am somewhat consistent.

  1. There are still about 400 duplicates sitting at the root level of the File folder. This is after I deleted over 150 files individually. Auditing is tedious because the real files should be filed within the sub folder structure. Spot check so far find all of the files filed.

  2. I do not use the Import folder for this library, all files are imported via the capture key and then the original is deleted. This means I would have had to manually go into hundreds of sub folders in the library and copy files into a location to be scanned. I clearly did not do any such thing.

  3. I did turn off scan as per your instructions.

  4. The duplicates vary between exact copies and copies where the metadata has reverted to the Word template document from which they were created but the content is exactly the same as the original files added to the library. I rarely get files with the original Word template metadata intact. It is clear however, that EF did not identify any of these as duplicates even though they are exact duplicates in many cases. See the attachment.

  5. The files in the library are rarely edited. They are used for lookups or to copy, but not change, content inside of the Word documents. All documents in the Library are tagged and filed into folders matching a large website structure.

Bottom line: What is the quickest and easiest way to delete the duplicates. What can be done to avoid this type of corruption again. I do use the Import folder on other libraries so it would be helpful to be able to turn scanning on again.

I see, so the duplicate files are not in the same folders as the originals? Did you initially import the files at the top level (the “Files” folder) and then move them into subfolders? Or did you capture them directly into the subfolders?

In the cases where there are multiple duplicates of the same file, could you give an example of which folders they are in?

I’m not saying that you did that, but it seems incontrovertible that the files ended up there and so EagleFiler detected and imported them. It didn’t conjure them out of nothing.

That’s to be expected. When EagleFiler detects that a file has been added to a folder that it manages, it does not check for duplicates. It assumes that the file was purposely put there and so deleting it would be rude.

It does check for duplicates when capturing or importing via drag and drop or the “To Import” folder.

Are there any non-duplicates at the top level of the “Files” folder? Are there any duplicates in subfolders?

In order to prevent it, we would need to know for sure what caused it. Right now I have two theories:

  1. Maybe some other software added the duplicates at the top level after EagleFiler had moved them into subfolders. Where is your library stored? Any differences compared with your other libraries that are not exhibiting this problem? Are you using any syncing or file sharing software, e.g. Dropbox?
  2. Maybe there was a permissions or other filesystem problem such that when you filed into the subfolders the files were copied instead of moved, i.e. the originals were left behind at the top level and later detected as duplicates. If so, you could test this by capturing a new file, moving it, and seeing whether there’s still a file at the top level.

Turning off “Scan for New Files” does not affect the “To Import” folder. They’re separate features.

Duplicate files continued
—Quote (Originally by tansey)—

  1. There are still about 400 duplicates sitting at the root level of the File folder. This is after I deleted over 150 files individually. Auditing is tedious because the real files should be filed within the sub folder structure.
    —End Quote—
    I see, so the duplicate files are not in the same folders as the originals? Did you initially import the files at the top level (the “Files” folder) and then move them into subfolders? Or did you capture them directly into the subfolders?
    —End Quote (Originally by tsai)—
    Documents were captured into top level and then tagged and moved to folders. The duplicates appear to be at the root level if they are properly named. I cannot find a location of the generic Learn2Grow Article Template files. Searching in the home folder only produces one hit, when there are 100+ duplicates.

Actually didn’t know you could capture to sub folders.

In the cases where there are multiple duplicates of the same file, could you give an example of which folders they are in?
—End Quote—
Named duplicates are at the File root level. I don’t see any at the sub folder level. Generic Learn2Grow Article Template files are a mystery to me.

—Quote (Originally by tansey)—
This means I would have had to manually go into hundreds of sub folders in the library and copy files into a location to be scanned. I clearly did not do any such thing.
—End Quote—
I’m not saying that you did that, but it seems incontrovertible that the files ended up there and so EagleFiler detected and imported them. It didn’t conjure them out of nothing.
—End Quote (Originally by tsai)—
Actually it looks like something conjured them out of nothing, but I am not saying that EF did it. I love EF.

—Quote (Originally by tansey)—
It is clear however, that EF did not identify any of these as duplicates even though they are exact duplicates in many cases.
—End Quote (Originally by tsai)—
That’s to be expected. When EagleFiler detects that a file has been added to a folder that it manages, it does not check for duplicates. It assumes that the file was purposely put there and so deleting it would be rude.

It does check for duplicates when capturing or importing via drag and drop or the “To Import” folder.
—End Quote (Originally by tsai)—
So the interesting issue is what was it scanning. I have no ideas at this time. Are there log files I can check/provide? All files imported to this library by me were done via capture key

—Quote (Originally by tansey)—
Bottom line: What is the quickest and easiest way to delete the duplicates.
—End Quote—
Are there any non-duplicates at the top level of the “Files” folder? Are there any duplicates in subfolders?
—End Quote (Originally by tsai)—
Subfolders don’t appear to have duplicates, and the top level is non duplicated, but I still don’t see were the generic Learn2Grow Article Template files are located.

Sampling is time consuming, but so far root level appears to be duplicates of subfolder articles. The problem is looking at the root and then confirming among 100’s of subfolders. The issue of the generic files is hard because the Finder (in my case Path Finder) does not show the generic Learn2Grow Article Template files.

—Quote (Originally by tansey)—
What can be done to avoid this type of corruption again.
—End Quote—
In order to prevent it, we would need to know for sure what caused it. Right now I have two theories:

  1. Maybe some other software added the duplicates at the top level after EagleFiler had moved them into subfolders. Where is your library stored? Any differences compared with your other libraries that are not exhibiting this problem? Are you using any syncing or file sharing software, e.g. Dropbox?
  2. Maybe there was a permissions or other filesystem problem such that when you filed into the subfolders the files were copied instead of moved, i.e. the originals were left behind at the top level and later detected as duplicates. If so, you could test this by capturing a new file, moving it, and seeing whether there’s still a file at the top level.
    —End Quote (Originally by tsai)—
    File is in Documents/Eaglefiler/L2G Articles. Most of my other EF libraries reside there but some are in other locations.

Don’t know why some other program might have reached into EF files and duplicated them and then put them in a place to be scanned for addition. Also unclear how the metadata was pulled out to rename the files and duplicate hundreds with the same file name.

Other libraries seem to be ok. Yes I am using Dropbox and MobileMe but I don’t see any relationship to duplicates since the duplicate files are long gone from Dropbox where they originated, and even those that arrived through Dropbox had correct metadata.

—Quote (Originally by tansey)—
I do use the Import folder on other libraries so it would be helpful to be able to turn scanning on again.
—End Quote—
Turning off “Scan for New Files” does not affect the “To Import” folder. They’re separate features.


—End Quote (Originally by tsai)—
Didn’t understand that, I need to read documentation to understand this feature. How do I turn it back on. The provided page turns it off but I don’t know if I have to use terminal or some other script to turn it back on.

Michael I know you are great at ferreting out and supporting your users so I am more than willing to try anything you suggest. Besides I couldn’t live without SpamSieve!

Note! I am having real problems with posting. I log in and then post a reply. When I go to submit, it acts as if I am not logged in. The reply is lost and I have to start over. Fortunately, since this has happened multiple times, I copy my response before hitting Submit Reply. It showed me as logged in before hitting Reply and then asking me to log in. Is this a timeout issue?

Thanks Frank

If you see it in EagleFiler, can you not use the Reveal in Finder command to see which folder it’s in?

You can do that with the Capture with options key.

Could you post a screenshot that shows the full filenames?

I’m not sure what you’re asking here. It scans the “Files” folder, which is where the files are appearing.

EagleFiler has a log file, but it only logs errors. In this case, from EagleFiler’s point of view I think this was all normal.

Can’t you look for the in EagleFiler rather than in the Finder or PathFinder? From what you say above, it sounds like you could select and delete the files in EagleFiler’s Unfiled smart folder.

From your screenshots, it does not look like the files were renamed. It looks like for the duplicates the title was pulled from the file’s metadata; whereas for the originals the title is something newer that you’d set in EagleFiler’s database.

Scanning operates on the contents of the “Files” folder. EagleFiler scans when opening the library and when you choose the command from the menu.

The “To Import” folder is next to the “Files” folder and EagleFiler scans it continually.

The esoteric preferences page has links for turning scanning on and off.

Since you say that the other libraries are working normally, perhaps an easy solution would be to remove the duplicates (as described above), then make a new library, and import the contents of the problem library. Hopefully the new library will not experience the same problem.

I’m not sure. This is the first time that someone has reported problems posting. It doesn’t sound like a timeout issue to me. Perhaps a cookie issue. Which version of Mac OS X and which browser are you using?