Parsing Text Into Pandas Using Regular Expressions, But Empty Dataframe Produced
I am trying to parse the following data into pandas from a text file: genome Bacteroidetes_4 reference B650 source carotenoid genome Desulfovibrio_3 reference B123 source Polyketi
Solution 1:
the parse_file function needs some logical correction.
Changes are like:
- case sensitive 'Source' vs 'source'
- moved data.append(row)
- readline() vs readlines(), read line will read one line ; so "for loop" was looping on characters which was not the intend here.
defparse_file(filepath):
data = []
# open the file and read through it line by linewithopen(filepath, 'r') as file_object:
lines = file_object.readlines() # change 1 for line in lines:
# at each line check for a match with a regex
key, match = parse_line(line)
# extract from each lineif key == 'genome':
genome = match.group('genome')
if key == 'source':
Source = match.group('source') # 4 typo, s need to be small letter.#while line.strip(): Change 3, code moved after source, and while loop commented.
row = {
'genome': genome,
'reference': reference,
'source': Source, # change 5, s to be small in the header
}
data.append(row)
if key == 'reference':
reference = match.group('reference') # change 2
data = pd.DataFrame(data)
return data
output:
genome,reference,Source
Bacteroidetes_4,B650,carotenoid
Desulfovibrio_3,B123,Polyketide
Desulfovibrio_3,B839,flexirubin
Solution 2:
Didn't quite get there, however I think you can avoid regex.
I edited your original data to be as follows (NB the addition of genome Desulfovibrio_3 three lines from the bottom) and added it to a text file. I can't think of a nice way to do this addition programmatically. You can of course loop through the file line by line and where you don't find a 'genome' right before a 'source', add in the 'most recent' one - I'll leave this to you.
reference B650
source carotenoid
genome Desulfovibrio_3
reference B123
source Polyketide
genome Desulfovibrio_3
reference B839
source flexirubin
You can then use pandas to read said file as follows:
data = pd.read_csv('./sample_data.txt', sep=" ", header='infer', names=['data']).T
df1 = pd.DataFrame(data.values.reshape(-1, 3), columns=['genome','reference','source'])
df1.to_csv('./output.csv', index=False)
Source for the second line: How to slice a row with duplicate column names and stack that rows in order
Output:
genome,reference,source
Bacteroidetes_4,B650,carotenoid
Desulfovibrio_3,B123,Polyketide
Desulfovibrio_3,B839,flexirubin
Post a Comment for "Parsing Text Into Pandas Using Regular Expressions, But Empty Dataframe Produced"