Sunday, April 26, 2009

De-Dupe Files Script Tool for Windows 32bit V2.0beta

I know, it’s less than a month that I introduced the De-Dupe script Version 1.1. See my original blog post where I introduced the script here. I used the script myself quite a lot and implemented improvements and additional features to it as a consequence of it.
 
I won’t post the whole source code in this post this time though, because it got significantly bigger and would cause problems with loading the post in your web browser. Well, the code is still included in the install package of the script/tool and 42KB zipped is not very much to download, isn’t it.
 
Download De-Dupe V2.0 Beta ( Roy-DeDupeScript20b.zip)

The background story and general file structure remained the same, but I made a lot other changes, especially “under the hood”.  The most important improvement is the PERFORMANCE. Where V1.1 was taking several minutes to process, V2.0 only takes seconds instead. De-Duping folders with thousands and not just hundreds of files is now NOT A PROBLEM for this script anymore. Here are the details:

Name of the Software: DeDupe Files Script Tool
Author: Carsten Cumbrowski
Version 2.0 beta
License: Freeware
Date: April 2009

Visit http://www.cumbrowski.com/ for resources to web and database development and internet marketing. There you can also find the contact page with various means to get in touch with the author of this tool.

The script detects duplicate files within a directory.

Duplicate files are files that have the same MD5 Check Sum value.
Two DIFFERENT/NON IDENTICAL files having the same MD5 Check Sum is not impossible, but highly unlikely.
This allows the script to detect duplicate files regardless of their file name or other characteristics, such as "date created" or "date modified".

The tool scans all files within a directory. It does not include files in sub directories of the processed folder.

Multi-Threading

The Script now supports Multi-Threading for the MD5 Checksum determination. That was the bottle-neck of the previous version of the script. The default is set to 50 threads, but it can be changed in the settings or on the fly via the command line switch /threads:NN where NN should be a number greater than or equal 1. I don't know the maximum value here, because it depends on the machine where the script runs on.

Be careful and only increase it in small steps to increase performance even more. The thread count starts actually with 0, which means, that if you set /threads:49 (default) then you actually get 50 threads.

Dupe Actions

There are 5 different Actions you can choose from to tell the script what to do, if it finds a duplicate file

  1. rename dupe to aFile1_EXT_bFile2[DEDUPED].EXT where aFile1 is the original file followed by "_" and its EXT(ension), "_"     bFile2 is the original base file name of the Dupe followed by [DEDUPED] and the dupe files original .EXT(ension)
  2. Rename dupes as in 1, but MOVE to a new sub folder "[Deduped]" of the path being processed
  3. (DEFAULT ACTION) Don't rename dupes, just MOVE to a new sub folder "[Deduped]"
  4. delete dupes (gone for good, unless you enabled "Recycled Bin" to be able to recover deleted files
  5. Create sub folder at specified location (/cdb:BACKUPPATH) with name "yyyy-mm-dd_hh-mm-ss_FolderName",  create index file !Index.txt with archive location and name and original locations of files, separated by "|"

The Default DeDupe Action can be overwritten via the command line option
/action:[1...5] , e.g. /action:3  for the default Action

Examples:


Dupe Actions

Dupe Action (1)
If a duplicate file is found, it will be renamed by by appending the original file name as prefix with an '_' as separator, which is also used to replace the "." that indicates the file extension of the original file name (other "." in the file name itself remain). At the end of the file name is the string [DEDUPED] added.

For Example aFile1.EXT and bFile2.EXT are identical. After the script was executed, one of the two files will remain as it is and the other one is being renamed. Which file will be considered the "original" is determined by which file was found first. The script sorts the files by name first, before it dedupes them.

In this example bFile1.EXT would be considered the original and bFile2.EXT will be renamed to aFile1_EXT_bFile2[DEDUPED].EXT. This makes dupes appear right after the original, if you sort the directory by file name. To be able to filter the dupes to copy/move them away or to delete them, use the copy, move or del command in MS DOS. For example "DEL *[DEDUPED].*" would delete all duplicate files found and renamed by the script.

Dupe Action (2)
If you want the dupes renamed as in Action 1, but would like to have them moved away from the source directory, choose Dupe Action 2. Dupes are still being renamed as in (1), but the script moves the dupes to a sub directory called "[DeDuped]" within the processed folder.

Dupe Action (3)
If you just want the dupes moved away from the source folder, but keep the original file names, use this Dupe Action (which is the default action btw)  You will find the duplicate files all in the subfolder "[DeDuped]" in the processed folder.

Dupe Action (4)
If you simply want to get rid of the dupes and delete them, use this Action

Dupe Action (5)
A variation of Action 3. The Difference is that the dupes are not moved to a sub folder below the processed file folder, but to a central dupe archive folder, where a new sub directory is being created with the date and time of the DeDupe processing and the Folder Name that was DeDuped.

The default location for that centralized backup folder is "C:\[DEDUPE_BACKUP]", but that can be changed. Either via the registry settings or on the fly via command line option:

/cdb:BACKUPPATH

Log Files

The script creates one file by default and the second one optional in the processed directory:

"!DeDupe-FileList.txt"
- (optional feature) a list of all files in the directory and their MD5 Check Sum Values (tab separated)

"!DeDupeLog.txt" - (enabled by default) a processing logfile where you can find the list of dupes that were detected, their old & new file name and the corresponding original file

If you do not want any of the files to be created, change the options for "WriteFileList" and "WriteDeDupeLog" to "0" in the beginning of the code of "DedupeFilesInFolder2.vbs" ; alternatively use the command line options /log:[0/1] and /list:[0/1] to turn the creation of the list and/or log on/off. You can also specify a different file name for the file list and the DeDupe log, but you cannot change the path.

/list:0 or /list:1
/listfile:FileListName.Ext

/log:0 or /log:1
/logfile:LogFileName.Exe

You can also suppress all dialogs via the command line option /quite:[0/1]. /quite:1 would disable the progress dialog, results message and all error messages.

/quite:0 or /quite:1

Note, the script returns error levels for batch processing regardless of the "quiet" settings.
The ErrorLevel codes are:

0 = Script Ran Successful
1 = Script Ran, but there were no files to process
2 = The script was aborted (only relevant if progress dialog is on)
4 = Script Error (md5sum.exe not or processing path not found)

Important. Version 2 of the Script enforces execution with CSCRIPT.EXE it re-launches itself, if it is executed with WSCRIPT.EXE. It is doing exactly that on purpose, if executed via the Shell Extension. If you are using the Quiet option and want to get the correct ErrorLevel back, you must execute the script with CSCRIPT.EXE from your application!

Installation/De-Installation


Use the provided Batch Scripts "DeDupeInstall.bat" and "DeDupeUnInstall.bat" to install or un-install the De-Dupe Shell Extension.

Installation

Double click on the Batch Script File "DeDupeInstall.bat"
That's it.

Notes:
The install batch file copies md5sum.exe and DedupeFilesInFolder2.vbs into your System32 directory under your windows installation directory and Imports the registry file "DedupeInstall.reg" into your systems registry database. It creates entries under the Registry Key: HKEY_LOCAL_MACHINE\SOFTWARE\Classes\Directory\shell\

Non of the files in the installation directory will be needed anymore to run the script itself. You will need them only to uninstall the tool or to re-install it again, if necessary.

Uninstallation

Double click on the Batch Script File "DeDupeUnInstall.bat"
That's it.

Notes:
The Un-Install batch file deletes the two files from your System32 directory and utilizes the registry file DedupeUnInstall.reg to remove the entries for the script from your systems registry database. If you want to continue to use the tool md5sum.exe and only want to disable the shell extension, either simply double click on the file DedupeUnInstall.reg without executing the uninstall batch file (the script DedupeFilesInFolder2.vbs
will remain in your System32 folder though) or you can copy the tool back into your system folder manually after you ran the uninstall batch file.

Script Settings in the System Registry


The Install Script automatically creates the default settings for the the script execution in the Windows Registry.
If you did not use the install script and run the DeDupe script for the first time, the script will create the missing registry entries based on the default values specified in the script code itself. Those settings are specifically of importance for the use of the DeDupe shell extension in Windows Explorer.

The settings in the registry are set for each Windows User separately.
To find and modify the settings, open the Registry Editor that comes with Windows. (Start / Run, enter: Regedit, press "enter")

Navigate to HKEY_CURRENT_USER \ Software \ DeDupe2 \ Parameters

Value Name               Type        Default Value          Command Line Param Equivalent
------------------------------------------------------------------------------------------
CentralDupeBackupFolder  REG_SZ      C:\[DEDUPE_BACKUP]            /cdb:PATH
DeDupeLogFName           REG_SZ      !DeDupeLog.txt                /logfile:FILENAME
DupeAction               REG_DWORD   3                             /action:N
FileListFName            REG_SZ      !DeDupe-FileList.txt          /listfile:FILENAME
MaxForks                 REG_DWORD   31 (HEX) or 49 (Decimal)      /threads:NN
Quiet                    REG_DWORD   0                             /quite:N
WriteDeDupeLog           REG_DWORD   1                             /log:N
WriteFileList            REG_DWORD   0                             /list:N

Upgrade from Previous DeDupe Versions


You might noticed that the main script has the name "DedupeFilesInFolder2.vbs". The previous version of the script has the name "DedupeFilesInFolder.vbs"

The install script also creates a different shell extension with the name "DeDupe2". If you installed version 1.x of the script and then use the install script for version 2, both scripts will be installed on your machine. You could continue to use them in parallel, if you want to, but I would not recommend it to the average user. I suggest to run the uninstall script of the previous version first and then the install script of the new version.

Note: If you run the uninstall script of the previous version AFTER you installed version 2 of the script, version 2 will no longer function properly because the uninstall script of the previous version also removes the 3rd party tool "md5sum.exe" from the System Directory. You either have to copy that tool back to the windows system directory manually or run the installation script for Version 2 once more. Doing that will overwrite any settings in the registry, which you might have changed already.

About the Software

The DeDupe Windows Explorer Shell Extension Script Tool is written in VBScript and is executed by the system tool WScript.exe. The DeDupe script (DedupeFilesInFolder2.vbs) uses a small support tool that it requires to work properly.

"md5sum.exe" is a small command line tool that return the MD5 Check Sum value for a file.
It can also validate MD5 check sums, which is a feature that is not used by the DeDupe script.
You can find out more information about it at http://etree.org/md5com.html
Md5Sum was written by bruce@gridpoint.com

Legal Stuff/Copyright and Disclaimer

The 3rd party tool that come with the DeDupe script is freeware and can be used and copied by anybody without the need of a license or to pay a fee.  Since I did not write that tool, I cannot take any responsibility for any issues that they might cause by it, via my script or without out.

This DeDupe script is also freeware and can be used, copied and modified for free,

Important Disclaimer!

The author, of this software accepts no responsibility for damages resulting from the use of this product and
makes no warranty or representation, either express or implied, including but not limited to, any implied warranty of merchantability or fitness for a particular purpose.

This software is provided "AS IS", and you, its user, assume all risks when using it.

Change Log

V1.1

  • MD5Sum Determination Issue Resolved for file names with spaces in it
  • Sorting by File Name Issue Resolved, now the "original" is really the first one sorted by name
  • Progress dialog implemented to show status
  • Quiet option implemented to suppress all dialogs
  • Return of ErrorLevels implemented for batch scripts that call the script
  • Rename logic changed, [DEDUPED] added to the renamed file in addition to existing logic
  • touch.exe tool removed. It did not work reliable, period
  • File List output with file names and their MD5 checksums implemented
  • Log File output implemented
  • command line parameters introduced to suppress file list and log file creation as well as to enable/disable "quiet" mode
  • general code clean up

V2.0

  • MD5Sum Determination now separate Step using Multi-Threading for increased Performance
  • New DupeAction: 2, 3, 4 and 5 implemented
    1. = Rename dupes as in 1, but MOVE to a new sub folder "[Deduped]" of the path being processed
    2. = Don't rename dupes, just MOVE to a new sub folder "[Deduped]"
    3. = delete dupes
    4. = Create sub folder at specified location with name "yyyy-mm-dd_hh-mm-ss_FolderName", create index file "!Index.txt"   with archive location and name and original locations of files, separated by "|"
  • Script now Enforces CSCRIPT.EXE (call from Shell Extension still uses WScript, because if I run it with Cscript from there a stupid DOS Shell Window is visible and open all the time)
  • Message Output changed to use IE because of CSCRIPT execution in batch mode (which suppresses Wscript.echo)
  • Settings now Saved in Registry, Manual overwrite via command line parameters is still possible. Use of Defaults in Code vs. Registry is also an option.

I hope that you will find this script useful. Please let me know your opinion, suggestions, feedback and recommendations for improvements via the comments section down below.

Cheers!

Carsten aka Roy/SAC

3 comments:

Adam Brock said...

Thanks for sharing this, I stumbled on this last night and it's been a big help.

Would it be easy to modify the script to search for duplicates amongst a set of sub folders as well?

Carsten a.k.a. Roy/SAC said...

Hi Adam,

It would require some major changes to the script, but I am kind of working on a new version, where I also try to speed up the processing even more. There I also want to incorporate the de-duping of sub directories. Hang in there with me :)

Unknown said...

I think I can use this too. Great work! Any idea on when you think a searching among subfolders feature might be completed?

Post a Comment

Hi, thanks for taking the time to comment at my blog.

Due to spam issues comments are not immediately posted on the site and require my manual approval first, before they become visible.

I try to approve comments as quickly as possible and usually within 24 hours.

To be notified about follow up comments that are made after yours, use the subscribe option with your email address and you will receive an email alert, if somebody else comments at this post in the future.

Also check out the rest of the website beyond this blog, visit RoySAC.com. Also see my YouTube channels, SACReleases for intros and demos.

Cheers!
Carsten aka Roy/SAC

Note: Only a member of this blog may post a comment.