Skip to content Skip to sidebar Skip to footer

Parsing Text Into Pandas Using Regular Expressions, But Empty Dataframe Produced

I am trying to parse the following data into pandas from a text file: genome Bacteroidetes_4 reference B650 source carotenoid genome Desulfovibrio_3 reference B123 source Polyketi

Solution 1:

the parse_file function needs some logical correction.

Changes are like:

  1. case sensitive 'Source' vs 'source'
  2. moved data.append(row)
  3. readline() vs readlines(), read line will read one line ; so "for loop" was looping on characters which was not the intend here.
defparse_file(filepath):    
    data = []  
    # open the file and read through it line by linewithopen(filepath, 'r') as file_object:        
        lines = file_object.readlines() # change 1        for line in lines:   
            # at each line check for a match with a regex
            key, match = parse_line(line)

            # extract from each lineif key == 'genome':
                genome = match.group('genome')
            if key == 'source':
                Source = match.group('source') # 4 typo, s need to be small letter.#while line.strip(): Change 3, code moved after source, and while loop commented.
                row = {
                    'genome': genome,
                    'reference': reference,
                    'source': Source, # change 5, s to be small in the header
                    }
                data.append(row)
            if key == 'reference':
                reference = match.group('reference') # change 2
                
        data = pd.DataFrame(data)
    return data

output:

genome,reference,Source
Bacteroidetes_4,B650,carotenoid
Desulfovibrio_3,B123,Polyketide
Desulfovibrio_3,B839,flexirubin

Solution 2:

Didn't quite get there, however I think you can avoid regex.

I edited your original data to be as follows (NB the addition of genome Desulfovibrio_3 three lines from the bottom) and added it to a text file. I can't think of a nice way to do this addition programmatically. You can of course loop through the file line by line and where you don't find a 'genome' right before a 'source', add in the 'most recent' one - I'll leave this to you.

reference B650
source carotenoid

genome Desulfovibrio_3
reference B123
source Polyketide
genome Desulfovibrio_3
reference B839
source flexirubin

You can then use pandas to read said file as follows:

data = pd.read_csv('./sample_data.txt', sep=" ", header='infer', names=['data']).T
df1 = pd.DataFrame(data.values.reshape(-1, 3), columns=['genome','reference','source'])
df1.to_csv('./output.csv', index=False)

Source for the second line: How to slice a row with duplicate column names and stack that rows in order

Output:

genome,reference,source
Bacteroidetes_4,B650,carotenoid
Desulfovibrio_3,B123,Polyketide
Desulfovibrio_3,B839,flexirubin

Post a Comment for "Parsing Text Into Pandas Using Regular Expressions, But Empty Dataframe Produced"