Read Large Csv File With Many Duplicate Values, Drop Duplicates While Reading
I have the following pandas code snippet that reads all the values found in a specific column of my .csv file. sample_names_duplicates = pd.read_csv(infile, sep='\t',
Solution 1:
Not "on the fly", although drop_duplicates
should be fast enough for most needs.
If you want to do this on the fly, you'll have to manually track duplicates on the particular column:
import csv
seen = [] # orset()
dup_scan_col =3
uniques = []
withopen('yourfile.csv', 'r') as f:
reader = csv.reader(f, delimiter='\t')
forrowin reader:
if row[dup_scan_col] notin seen:
uniques.append(row)
seen.append(row[dup_scan_col])
Solution 2:
As the result returned by read_csv()
is an iterable, you could just wrap this in a set()
call to remove duplicates. Note that using a set will loose any ordering you may have. If you then want to sort, you should use list()
and sort()
Unique unordered set example:
sample_names_duplicates = set(pd.read_csv(infile, sep="\t", engine="c", usecols=[4], squeeze=True))
Ordered list example:
sample_names = list(set(pd.read_csv(infile, sep="\t", engine="c", usecols=[4], squeeze=True)))
sample_names.sort()
Post a Comment for "Read Large Csv File With Many Duplicate Values, Drop Duplicates While Reading"