Removing duplicates from a List

SubSpace

Bass
Just a random .NET code snippet that is probably of marginal value to you, but I happened to need it for my project.

The problem:
I have a list of arbitrary items (integers, objects, whatever) that might contain duplicates that I have to remove before doing further processing.
The catch is that I cannot change the order of the items in the list, and the whole thing should be high performance (as far as .NET goes).
If it wasn't for the order requirement, sorting the List and walking through the array, comparing element n with element n-1 would do the trick. You would need to allocate a new List to avoid lots of moving elements around when you do RemoveAts though.

I did some searching on the internet (modern day programming) and found some references to the Pollard's Rho algorithm, but it didn't seem usable in this case.

This is what I came up with:
Code:
using System.Collections.Generic;

public static void Dedupe<T>(List<T> list)
{
	HashSet<T> items = new HashSet<T>(list);

	int writePos = 0;
	for (int readPos = 0; readPos < list.Count; readPos++)
	{
		T item = list[readPos];
		if (items.Contains(item))
		{
			// Item is still in the hash set, so we're seeing it for the first time
			list[writePos] = item;
			items.Remove(item);
			writePos++;
		}
	}

	list.RemoveRange(writePos, list.Count - writePos);
}

The HashSet Contains() and Remove() operations are O(1) operations. Building the initial HashSet should be O(n), as well as walking through the List once. I left out List.TrimExcess() as I'm deallocating the List shortly after anyway.

Good thing I found out about HashSet, I had never actually used it before... Now let's see if it performs as well as I hope :)
 
Cool,

could you use linq? Assuming .net 3.5

var dedupeList= fulllist.Distinct().ToList();

Filtering might be quicker as well.

Cheers,

paul.
 
Hmm, that would have worked. For some reason I wasn't sure that it would maintain item order, but since it is an extension method of IEnumerable, that makes sense. Looking at the Distinct implementation, it appears to use an internal Set class very similar to HashSet. In fact, apparently I'm not the only one reinventing the wheel..

The implementation adds to the Set instead of removing like it did in my code, which would be slightly faster I suppose. On the other hand, my implementation does the job in-place, which makes for less garbage collection for one thing.
I wouldn't have cared enough about it had I thought of Distinct(), but now that I have it, I might as well keep it around :)
 
Back
Top