Skip to content Skip to sidebar Skip to footer

Read Large Csv File With Many Duplicate Values, Drop Duplicates While Reading

I have the following pandas code snippet that reads all the values found in a specific column of my .csv file. sample_names_duplicates = pd.read_csv(infile, sep='\t',

Solution 1:

Not "on the fly", although drop_duplicates should be fast enough for most needs.

If you want to do this on the fly, you'll have to manually track duplicates on the particular column:

import csv

seen = [] # orset()
dup_scan_col =3
uniques = []

withopen('yourfile.csv', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    forrowin reader:
       if row[dup_scan_col] notin seen:
          uniques.append(row)
          seen.append(row[dup_scan_col])

Solution 2:

As the result returned by read_csv() is an iterable, you could just wrap this in a set() call to remove duplicates. Note that using a set will loose any ordering you may have. If you then want to sort, you should use list() and sort()

Unique unordered set example:

sample_names_duplicates = set(pd.read_csv(infile, sep="\t", engine="c", usecols=[4], squeeze=True))

Ordered list example:

sample_names = list(set(pd.read_csv(infile, sep="\t", engine="c", usecols=[4], squeeze=True)))
sample_names.sort()

Post a Comment for "Read Large Csv File With Many Duplicate Values, Drop Duplicates While Reading"