Tuesday, January 16, 2024

[SOLVED] Codes to add multiple new columns based on row strings of a column from one file matched to that present in multiple columns of another file

January 16, 2024 bash, python, r

Issue

I'm very basic to writing complex codes for a situation like the one here. I have 2 files with different number of columns and rows: File_1.csv & File_2.csv. The 3rd column 'Structure' in File_2.csv has combined row strings from 5th 'Structure_1', 6th 'Structure_2', 7th 'Structure_3' columns of File_1.csv. File_1.csv can have multiple rows with same row strings of those three 'Structure_1', 'Structure_2', 'Structure_3' columns, while File_2.csv 'Structure' column only have the row strings occuring once. Now I would like to use the 4th column 'Barcodes' of File_2.csv to the File_1 every time it matches with the 'Structure' column of File_2.csv, and add three new columns named 'Barcode_1', 'Barcode_2', 'Barcode_3' at the end of File_1.csv, which would correspond to their 'Structure_1', 'Structure_2', 'Structure_3' strings and their position in File_1.csv. Since the row strings of 'Structure_1', 'Structure_2', 'Structure_3' in File_1.csv can be present multiple times it can be the case that the row strings for 'Barcode_1', 'Barcode_2', 'Barcode_3' are added more than once to File_1.csv as well. Please find the 2 files & the example File_3_output.csv file.

File_1.csv File_2.csv File_3_output.csv Any Bash or Python or R codes would be greatly appreciated.

Solution

You can directly read a .CSV file into a pandas dataframe using pd.read_csv. In the code below I created two sample dataframes as I do not have your .CSV files. There are numerous ways to do the mapping (such as by merging the dataframes) but code below creates and then uses a mapping dictionary. This should be enough to get you going. The result dataframe can then be written out as a .CSV.

import pandas as pd

df1 = pd.DataFrame({"struct1": ['CDAB', 'ACEG', 'KLMN'],
                   "struct2": ['ACEG', 'XAYBZ', 'ACEG']
                   })

df2 = pd.DataFrame({"struct": ['CDAB', 'ACEG', 'KLMN', 'XAYBZ'],
                    "barcodes": ['111', '222', '333', '444']
                    })

print(df1)
print(df2)

d = dict(df2[['struct', 'barcodes']].values)

df1['barcodes_1']= df1['struct1'].map(d)
df1['barcodes_2']= df1['struct2'].map(d)

print(df1)

which gives

  struct1 struct2
0    CDAB    ACEG
1    ACEG   XAYBZ
2    KLMN    ACEG

  struct barcodes
0   CDAB      111
1   ACEG      222
2   KLMN      333
3  XAYBZ      444

  struct1 struct2 barcodes_1 barcodes_2
0    CDAB    ACEG        111        222
1    ACEG   XAYBZ        222        444
2    KLMN    ACEG        333        222

Answered By - user19077881

Answer Checked By - Robin (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 16, 2024

[SOLVED] Codes to add multiple new columns based on row strings of a column from one file matched to that present in multiple columns of another file

Issue

Solution

Popular Posts

Labels