Issue
I'm very basic to writing complex codes for a situation like the one here. I have 2 files with different number of columns and rows: File_1.csv & File_2.csv. The 3rd column 'Structure' in File_2.csv has combined row strings from 5th 'Structure_1', 6th 'Structure_2', 7th 'Structure_3' columns of File_1.csv. File_1.csv can have multiple rows with same row strings of those three 'Structure_1', 'Structure_2', 'Structure_3' columns, while File_2.csv 'Structure' column only have the row strings occuring once. Now I would like to use the 4th column 'Barcodes' of File_2.csv to the File_1 every time it matches with the 'Structure' column of File_2.csv, and add three new columns named 'Barcode_1', 'Barcode_2', 'Barcode_3' at the end of File_1.csv, which would correspond to their 'Structure_1', 'Structure_2', 'Structure_3' strings and their position in File_1.csv. Since the row strings of 'Structure_1', 'Structure_2', 'Structure_3' in File_1.csv can be present multiple times it can be the case that the row strings for 'Barcode_1', 'Barcode_2', 'Barcode_3' are added more than once to File_1.csv as well. Please find the 2 files & the example File_3_output.csv file.
File_1.csv File_2.csv File_3_output.csv Any Bash or Python or R codes would be greatly appreciated.
Solution
You can directly read a .CSV file into a pandas dataframe using pd.read_csv. In the code below I created two sample dataframes as I do not have your .CSV files. There are numerous ways to do the mapping (such as by merging the dataframes) but code below creates and then uses a mapping dictionary. This should be enough to get you going. The result dataframe can then be written out as a .CSV.
import pandas as pd
df1 = pd.DataFrame({"struct1": ['CDAB', 'ACEG', 'KLMN'],
"struct2": ['ACEG', 'XAYBZ', 'ACEG']
})
df2 = pd.DataFrame({"struct": ['CDAB', 'ACEG', 'KLMN', 'XAYBZ'],
"barcodes": ['111', '222', '333', '444']
})
print(df1)
print(df2)
d = dict(df2[['struct', 'barcodes']].values)
df1['barcodes_1']= df1['struct1'].map(d)
df1['barcodes_2']= df1['struct2'].map(d)
print(df1)
which gives
struct1 struct2
0 CDAB ACEG
1 ACEG XAYBZ
2 KLMN ACEG
struct barcodes
0 CDAB 111
1 ACEG 222
2 KLMN 333
3 XAYBZ 444
struct1 struct2 barcodes_1 barcodes_2
0 CDAB ACEG 111 222
1 ACEG XAYBZ 222 444
2 KLMN ACEG 333 222
Answered By - user19077881 Answer Checked By - Robin (WPSolving Admin)