Issue
I am trying to split the data and rearrange the data in a CSV file. My data looks something like this
1:100011159-T-G,CDD3-597,G,G
1:10002775-GA,CDD3-597,G,G
1:100122796-C-T,CDD3-597,T,T
1:100152282-CAAA-T,CDD3-597,C,C
1:100011159-T-G,CDD3-598,G,G
1:100152282-CAAA-T,CDD3-598,C,C
and I want a table that looks like this:
ID | 1:100011159-T-G | 1:10002775-GA | 1:100122796-C-T | 1:100152282-CAAA-T |
---|---|---|---|---|
CDD3-597 | GG | GG | TT | CC |
CDD3-598 | GG | CC |
I have written the following code:
import pandas as pd
input_file = "trail_berry.csv"
output_file = "trail_output_result.csv"
# Read the CSV file without header
df = pd.read_csv(input_file, header=None)
print(df[0].str.split(',', n=2, expand=True))
# Extract SNP Name, ID, and Alleles from the data
df[['SNP_Name', 'ID', 'Alleles']] = df[0].str.split(',', n=-1, expand=True)
# Create a new DataFrame with unique SNP_Name values as columns
result_df = pd.DataFrame(columns=df['SNP_Name'].unique(), dtype=str)
# Populate the new DataFrame with ID and Alleles data
for _, row in df.iterrows():
result_df.at[row['ID'], row['SNP_Name']] = row['Alleles']
# Reset the index
result_df.reset_index(inplace=True)
result_df.rename(columns={'index': 'ID'}, inplace=True)
# Fill NaN values with an appropriate representation (e.g., 'NULL' or '')
result_df = result_df.fillna('NULL')
# Save the result to a new CSV file
result_df.to_csv(output_file, index=False)
# Print a message indicating that the file has been saved
print("Result has been saved to {}".format(output_file))
but this has been giving me the following error:
Traceback (most recent call last):
File "berry_trail.py", line 11, in <module>
df[['SNP_Name', 'ID', 'Alleles']] = df[0].str.split(',', n=-1, expand=True)
File "/nas/longleaf/home/svennam/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 3367, in __setitem__
self._setitem_array(key, value)
File "/nas/longleaf/home/svennam/.local/lib/python3.5/site-packages/pandas/core/frame.py", line 3389, in _setitem_array
raise ValueError('Columns must be same length as key')
Can someone please help, I am having hard time figuring this out.Thanks in advance!
ValueError: Columns must be same length as key
Solution
What are you trying to do is called pivoting, so use DataFrame.pivot()
:
import pandas as pd
df = pd.read_csv("your_file.csv", header=None)
out = (
df.pivot(index=1, columns=0, values=2)
.rename_axis(index="ID", columns=None)
.reset_index()
)
print(out)
Prints:
ID 1:100011159-T-G 1:10002775-GA 1:100122796-C-T 1:100152282-CAAA-T
0 CDD3-597 GG GG TT CC
1 CDD3-598 GG NaN NaN CC
EDIT: With your updated input:
import pandas as pd
df = pd.read_csv("your_file.csv", header=None)
df["tmp"] = df[[2, 3]].agg("".join, axis=1) # <-- join the two columns together
out = (
df.pivot(index=1, columns=0, values="tmp")
.rename_axis(index="ID", columns=None)
.reset_index()
)
print(out)
Answered By - Andrej Kesely Answer Checked By - Clifford M. (WPSolving Volunteer)