Skip to content Skip to sidebar Skip to footer

Is Grouping In Dataframe Based On Specific Parameters Possible Using Python?

When you have a large data set in excel (xlsx, csv, or xls) and have certain repeating values that you have to select for, how do you do it? That's like a very vague and broad way

Solution 1:

I would do it this way:

df_out = pd.concat([df1,df2])
df_out = (df_out[df_out.groupby(['Name'])['No.'].transform(lambda x: x.nunique() > 1)]
              .reset_index(drop=True)
              .set_index(['Name','No.'], append=True)['Comment']
              .unstack([0,2]))
df_out.columns = df_out.columns.droplevel(0)
df_out

Output:

No.      2139300     2234903  2139300  2234903
Name                                          
John  Irrelevant  Regardless  Awesome  Perfect

Use reset_index to get unique index per row, then append 'name' and 'no.' to that index and unstack new row number index and no.to create a multiindex column header, then drop the top level of the column header.

You can use:

df_out.rename_axis(None, axis=1).rename_axis(None)

To get rid of index names and create a more "clean" table looking dataframe:

         2139300     2234903  2139300  2234903
John  Irrelevant  Regardless  Awesome  Perfect

Solution 2:

How about this?

1) Group & unstack dataframe1 and dataframe2 to get the general shape you're going for:

dataframe1_transformed = \
    dataframe1.groupby(["**Name**", '**No.**'])['**Comment**'].\
    sum().unstack("**No.**")

dataframe2_transformed = \
    dataframe2.groupby(["**Name**", '**No.**'])['**Comment**'].\
    sum().unstack("**No.**")

dataframe1_transformed

**No.** **Name**    21233202139300223490328328830       Bob         Doesnt MatterSomething  NoneNoneNone1       Joe         NoneNoneNone        Whatever
2       John        None                    Irrelevant  Regardless  None

dataframe2_transformed

**No.** **Name**    21233202139300223490328328830       Bob         GreatGood   NoneNoneNone1       Joe         NoneNoneNone    Solid
2       John        None        Awesome Perfect None

2) Combine them:

dataframe_all_transformed = \
    dataframe1_transformed.merge(dataframe2_transformed, 
                                 how='inner', left_index=True,
                                 right_index=True)

dataframe_all_transformed

**No.** **Name**    2123320_x               2139300_x       2234903_x   2832883_x   2123320_y   2139300_y   2234903_y   2832883_y
0       Bob         DoesntMatterSomething   NoneNoneNone        GreatGood   NoneNoneNone1       Joe         NoneNoneNone        Whatever    NoneNoneNone        Solid
2       John        None                    Irrelevant      Regardless  NoneNone        Awesome     Perfect     None

3) Separately count the number of unique appearances:

num_apperances = dataframe1.drop_duplicates(subset=['**Name**', '**No.**']).\
    groupby(['**Name**']).size()

multiple_appearing_names = num_apperances[num_apperances >1].index

4) Filter the combined transformed data just for those names:

dataframe_multiple_transformed = dataframe_all_transformed.loc[
    multiple_appearing_names].T.dropna().T

5) Technically it's a bad idea to have identical column names in a dataframe, but since you want it:

dataframe_multiple_transformed.columns = \
    [x.split("_")[0] for x in dataframe_multiple_transformed.columns]

dataframe_multiple_transformed

    **Name**    2139300     2234903     2139300 2234903
0   John        Irrelevant  Regardless  Awesome Perfect

Post a Comment for "Is Grouping In Dataframe Based On Specific Parameters Possible Using Python?"