Finding duplicate video clips
I thought I'd pick your brains on something that's been bugging me for a while now.
I'm in the process of cleaning some older archived footage. When I import all of this footage into a CatDV catalog, I nearly always get duplicate clips, simply because of the disorganized nature of the folders I'm working with.
Currently, I have several different ways of finding duplicates and partial duplicates:
1) Find Similar --> In & Out
- This only finds exact duplicates, which is helpful if the person who was originally working with the footage accidentally copied the exact file into a different folder.
2) Find Similar --> Duration
- This find duplicates which may not have the same In & Out, but could still be duplicates. Maybe the In & Out was changed at one point.
- This method is limited if I'm working with over 3000 clips, as the chances of exact durations are high.
3) Find Similar --> In or Find Similar --> Out
- This helps me find partial duplicates, or remnants of clips that were cut.
The issue I am having is with finding these partial duplicates. I don't want 3 files in the catalog where one is the full length 60s clip, for example, and the other two are 10s and 20s partials of the original. I only want to keep the full length clip.
My thought is that it'd be nice if I could find similar clips where the partials are within the In & Out range of the full length clip. This would find all of the cut duplicates which were used for editing, I think.
If anyone has other ideas, I'm up for trying them.
CatDV 9.0 here by the way, no worker node.
Just my 2 cents, but make sure it's worth the time.
In the end, keeping three versions of something (the 20, 30 and 60 for instance) may not be a huge problem and you can clean it up over time, unless you have to deliver this all to a large archive or library where they won't accept dupes.
I'd take it all in, clean it up as best you can, tag it and as people need things and search for them, they can flag the dupes and clean them up later.
Otherwise you may find yourself "cleaning" for a long time when you could be editing.
One of the best things about CatDV, and any MAM really, is that you can flag an asset to be looked at later so even if in the heat of a search you can't stop to fix it, a "#review" or something that you can search for can be left behind. Then, when you have some time you can search on that metadata and clean up the messes people found.
It sounds like you have it pretty well sorted. My only thought is that the difference between 90% an 100% may be a lot of work.
Of course this has little to nothing to do with your original question. I'm gonna wrap it up.
bryson "at" northshoreautomation.com
There are better fields in the look for similar command like File hash and clip signature among others that are useful in identifying duplicates. Also When you say a partial clip do you mean a sub-clip or another file? I realize you are on version 9 but in version 10 there is a disk space took that is really good at finding duplicates and very fast. Also in the preferences under import you can avoid duplicates on import. Hope this helps!
Here is alink that explains the clip identfiers and a screen grab of the disk space tool.
Thanks for the reply,
When I say duplicates, I mean different files with different names. The file hash and clip signature will be different because the file was saved/exported after being cut in FinalCut. It seems the only data that remains similar is the In & Out of the clip.
So for example, the In & Out of the 60 sec clip will be 54:50 - 55:50 and the 20 sec clip would be within the In & Out range of the full size clip, for example 55:20 - 55:40. In this instance, the file size, duration, In & Out, file hash and clip signature would all be different, making it difficult to find these clips as duplicates.
Just to test, I've been going through a 2500 clip catalog to see how many of these partial duplicates I had. Turned out there were over 150, taking up about 10GB in total. Considering the size of our library, if these aren't cleaned up, they will end up wasting a lot of storage space.
Got it I think I understand better. The FIle hash will find files with different names and though it is not full proof it would likely still be useful. Based on my estimation you have about 5-6% duplicates which in all honesty is not that bad from what I seen. One thing to keep in mind if these files will be required in the future to open a FCP session or re edit, if that is the case the time and or space saved in this process maybe lost if and when these projects are brought back to life? Also with the price of storage 150gb is probably about $10 worth of storage space plus the time to clean it up so this may cost you more in time than it does in disk space? In many of these situations there is the challenge of cleaning up the existing files and ideally putting some procedures in place to make this process easer in the future. One suggestion for making this easier moving forward is to append or prepend files exported from FCP with something indicating this is a duplicate (example M for Master and P for partial).
Hope this helps!