Merge multiple DataFrames Pandas
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
This might be considered as a duplicate of a thorough explanation of various approaches, however I can't seem to find a solution to my problem there due to a higher number of Data Frames.
I have multiple Data Frames (more than 10), each differing in one column VARX
. This is just a quick and oversimplified example:
import pandas as pd
df1 = pd.DataFrame({'depth': [0.500000, 0.600000, 1.300000],
'VAR1': [38.196202, 38.198002, 38.200001],
'profile': ['profile_1', 'profile_1','profile_1']})
df2 = pd.DataFrame({'depth': [0.600000, 1.100000, 1.200000],
'VAR2': [0.20440, 0.20442, 0.20446],
'profile': ['profile_1', 'profile_1','profile_1']})
df3 = pd.DataFrame({'depth': [1.200000, 1.300000, 1.400000],
'VAR3': [15.1880, 15.1820, 15.1820],
'profile': ['profile_1', 'profile_1','profile_1']})
Each df
has same or different depths for the same profiles, so
I need to create a new DataFrame which would merge all separate ones, where the key columns for the operation are depth
and profile
, with all appearing depth values for each profile.
The VARX
value should be therefore NaN
where there is no depth measurement of that variable for that profile.
The result should be a thus a new, compressed DataFrame with all VARX
as additional columns to the depth
and profile
ones, something like this:
name_profile depth VAR1 VAR2 VAR3
profile_1 0.500000 38.196202 NaN NaN
profile_1 0.600000 38.198002 0.20440 NaN
profile_1 1.100000 NaN 0.20442 NaN
profile_1 1.200000 NaN 0.20446 15.1880
profile_1 1.300000 38.200001 NaN 15.1820
profile_1 1.400000 NaN NaN 15.1820
Note that the actual number of profiles is much, much bigger.
Any ideas?
python pandas dataframe
add a comment |
This might be considered as a duplicate of a thorough explanation of various approaches, however I can't seem to find a solution to my problem there due to a higher number of Data Frames.
I have multiple Data Frames (more than 10), each differing in one column VARX
. This is just a quick and oversimplified example:
import pandas as pd
df1 = pd.DataFrame({'depth': [0.500000, 0.600000, 1.300000],
'VAR1': [38.196202, 38.198002, 38.200001],
'profile': ['profile_1', 'profile_1','profile_1']})
df2 = pd.DataFrame({'depth': [0.600000, 1.100000, 1.200000],
'VAR2': [0.20440, 0.20442, 0.20446],
'profile': ['profile_1', 'profile_1','profile_1']})
df3 = pd.DataFrame({'depth': [1.200000, 1.300000, 1.400000],
'VAR3': [15.1880, 15.1820, 15.1820],
'profile': ['profile_1', 'profile_1','profile_1']})
Each df
has same or different depths for the same profiles, so
I need to create a new DataFrame which would merge all separate ones, where the key columns for the operation are depth
and profile
, with all appearing depth values for each profile.
The VARX
value should be therefore NaN
where there is no depth measurement of that variable for that profile.
The result should be a thus a new, compressed DataFrame with all VARX
as additional columns to the depth
and profile
ones, something like this:
name_profile depth VAR1 VAR2 VAR3
profile_1 0.500000 38.196202 NaN NaN
profile_1 0.600000 38.198002 0.20440 NaN
profile_1 1.100000 NaN 0.20442 NaN
profile_1 1.200000 NaN 0.20446 15.1880
profile_1 1.300000 38.200001 NaN 15.1820
profile_1 1.400000 NaN NaN 15.1820
Note that the actual number of profiles is much, much bigger.
Any ideas?
python pandas dataframe
add a comment |
This might be considered as a duplicate of a thorough explanation of various approaches, however I can't seem to find a solution to my problem there due to a higher number of Data Frames.
I have multiple Data Frames (more than 10), each differing in one column VARX
. This is just a quick and oversimplified example:
import pandas as pd
df1 = pd.DataFrame({'depth': [0.500000, 0.600000, 1.300000],
'VAR1': [38.196202, 38.198002, 38.200001],
'profile': ['profile_1', 'profile_1','profile_1']})
df2 = pd.DataFrame({'depth': [0.600000, 1.100000, 1.200000],
'VAR2': [0.20440, 0.20442, 0.20446],
'profile': ['profile_1', 'profile_1','profile_1']})
df3 = pd.DataFrame({'depth': [1.200000, 1.300000, 1.400000],
'VAR3': [15.1880, 15.1820, 15.1820],
'profile': ['profile_1', 'profile_1','profile_1']})
Each df
has same or different depths for the same profiles, so
I need to create a new DataFrame which would merge all separate ones, where the key columns for the operation are depth
and profile
, with all appearing depth values for each profile.
The VARX
value should be therefore NaN
where there is no depth measurement of that variable for that profile.
The result should be a thus a new, compressed DataFrame with all VARX
as additional columns to the depth
and profile
ones, something like this:
name_profile depth VAR1 VAR2 VAR3
profile_1 0.500000 38.196202 NaN NaN
profile_1 0.600000 38.198002 0.20440 NaN
profile_1 1.100000 NaN 0.20442 NaN
profile_1 1.200000 NaN 0.20446 15.1880
profile_1 1.300000 38.200001 NaN 15.1820
profile_1 1.400000 NaN NaN 15.1820
Note that the actual number of profiles is much, much bigger.
Any ideas?
python pandas dataframe
This might be considered as a duplicate of a thorough explanation of various approaches, however I can't seem to find a solution to my problem there due to a higher number of Data Frames.
I have multiple Data Frames (more than 10), each differing in one column VARX
. This is just a quick and oversimplified example:
import pandas as pd
df1 = pd.DataFrame({'depth': [0.500000, 0.600000, 1.300000],
'VAR1': [38.196202, 38.198002, 38.200001],
'profile': ['profile_1', 'profile_1','profile_1']})
df2 = pd.DataFrame({'depth': [0.600000, 1.100000, 1.200000],
'VAR2': [0.20440, 0.20442, 0.20446],
'profile': ['profile_1', 'profile_1','profile_1']})
df3 = pd.DataFrame({'depth': [1.200000, 1.300000, 1.400000],
'VAR3': [15.1880, 15.1820, 15.1820],
'profile': ['profile_1', 'profile_1','profile_1']})
Each df
has same or different depths for the same profiles, so
I need to create a new DataFrame which would merge all separate ones, where the key columns for the operation are depth
and profile
, with all appearing depth values for each profile.
The VARX
value should be therefore NaN
where there is no depth measurement of that variable for that profile.
The result should be a thus a new, compressed DataFrame with all VARX
as additional columns to the depth
and profile
ones, something like this:
name_profile depth VAR1 VAR2 VAR3
profile_1 0.500000 38.196202 NaN NaN
profile_1 0.600000 38.198002 0.20440 NaN
profile_1 1.100000 NaN 0.20442 NaN
profile_1 1.200000 NaN 0.20446 15.1880
profile_1 1.300000 38.200001 NaN 15.1820
profile_1 1.400000 NaN NaN 15.1820
Note that the actual number of profiles is much, much bigger.
Any ideas?
python pandas dataframe
python pandas dataframe
edited 12 hours ago
PEBKAC
asked 16 hours ago
PEBKACPEBKAC
311110
311110
add a comment |
add a comment |
5 Answers
5
active
oldest
votes
Consider setting index on each data frame and then run the horizontal merge with pd.concat
:
dfs = [df.set_index(['profile', 'depth']) for df in [df1, df2, df3]]
print(pd.concat(dfs, axis=1).reset_index())
# profile depth VAR1 VAR2 VAR3
# 0 profile_1 0.5 38.198002 NaN NaN
# 1 profile_1 0.6 38.198002 0.20440 NaN
# 2 profile_1 1.1 NaN 0.20442 NaN
# 3 profile_1 1.2 NaN 0.20446 15.188
# 4 profile_1 1.3 38.200001 NaN 15.182
# 5 profile_1 1.4 NaN NaN 15.182
that's awesome, thank you! How would you do it within a loop, for example:for m in range(len(myfiles))
: (where I read separate files for each df)df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
1
Ah, my mistake, do not bracket m which casts as list:dfs = [pd.read_csv(m, index_col=[0,1]) for m in myfiles]
– Parfait
15 hours ago
1
You have multiple rows with same profile AND depth. Originally you had that same issue in your post and I noticed you edited the first df's depth from 0.6 to 0.5. Try de-duping or aggregating before setting index and concatenating.
– Parfait
14 hours ago
1
I believe that is a different question and you already accepted a solution here (which come to think may result in a duplicate joins). Make an earnest effort and come back to SO with specific issues.
– Parfait
14 hours ago
1
You should close this one out as answers here does resolve your immediate question that even uses posted data. The data size and even data content with dups is a different question.
– Parfait
12 hours ago
|
show 8 more comments
Or using merge
:
from functools import partial, reduce
dfs = [df1,df2,df3]
merge = partial(pd.merge, on=['depth','profile'], how='outer')
reduce(merge, dfs)
depth VAR1 profile VAR2 VAR3
0 0.6 38.198002 profile_1 0.20440 NaN
1 0.6 38.198002 profile_1 0.20440 NaN
2 1.3 38.200001 profile_1 NaN 15.182
3 1.1 NaN profile_1 0.20442 NaN
4 1.2 NaN profile_1 0.20446 15.188
5 1.4 NaN profile_1 NaN 15.182
Update
For merging the dataframes in a loop as suggested in the comments, you could do something like:
df_final = pd.DataFrame(columns=df1.columns)
for df in dfs:
df_final = df_final.merge(df, on=['depth','profile'], how='outer')
that's awesome, thank you! How would you do it within a loop, for example:for m in range(len(myfiles))
: (where I read separate files for each df)df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
1
Well the main purpose of reduce here is to avoid a loop. If you prefer that approach I assume for memory constraints, you need a single merge on each iteration. Simply update the resulting dataframe on each loop
– yatu
15 hours ago
thank you, that's super helpful, but would you perhaps care to show how such an iteration would look like, perhaps just here as a comment? I'm not really sure how to continue
– PEBKAC
15 hours ago
1
Check the update @PEBKAC
– yatu
15 hours ago
1
Well if you have to end up merging them all, you likely won't be able to obtain the final dataframe anyway. I'd suggest you to work with chunks of data. Check stackoverflow.com/questions/47386405/…
– yatu
14 hours ago
|
show 4 more comments
I would use append.
>>> df1.append(df2).append(df3).sort_values('depth')
VAR1 VAR2 VAR3 depth profile
0 38.196202 NaN NaN 0.5 profile_1
1 38.198002 NaN NaN 0.6 profile_1
0 NaN 0.20440 NaN 0.6 profile_1
1 NaN 0.20442 NaN 1.1 profile_1
2 NaN 0.20446 NaN 1.2 profile_1
0 NaN NaN 15.188 1.2 profile_1
2 38.200001 NaN NaN 1.3 profile_1
1 NaN NaN 15.182 1.3 profile_1
2 NaN NaN 15.182 1.4 profile_1
Obviously if you have a lot of dataframes, just make a list and loop through them.
thank you! @BlivetWidget, how do you sort it both by depth AND profile? each profile has a set of depths and each dataframe has a bunch of profiles?
– PEBKAC
10 hours ago
1
@PEBKAC you can sort it by however many parameters you want, in whatever order you want. .sort_values(['depth', 'profile']) or .sort_values(['profile', 'depth']). You can check the help on df1.sort_values to learn how to change the sort order, to sort in place, and various other optional parameters.
– BlivetWidget
10 hours ago
thank you, most helpful!
– PEBKAC
10 hours ago
add a comment |
Why not concatenate all the Data Frames, melt, then reform them using your ids? There might be a more efficient way to do this, but this works.
df=pd.melt(pd.concat([df1,df2,df3]),id_vars=['profile','depth'])
df_pivot=df.pivot_table(index=['profile','depth'],columns='variable',values='value')
Where df_pivot
will be
variable VAR1 VAR2 VAR3
profile depth
profile_1 0.5 38.196202 NaN NaN
0.6 38.198002 0.20440 NaN
1.1 NaN 0.20442 NaN
1.2 NaN 0.20446 15.188
1.3 38.200001 NaN 15.182
1.4 NaN NaN 15.182
add a comment |
You can also use:
dfs = [df1, df2, df3]
df = pd.merge(dfs[0], dfs[1], left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
for d in dfs[2:]:
df = pd.merge(df, d, left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
depth VAR1 profile VAR2 VAR3
0 0.5 38.196202 profile_1 NaN NaN
1 0.6 38.198002 profile_1 0.20440 NaN
2 1.3 38.200001 profile_1 NaN 15.182
3 1.1 NaN profile_1 0.20442 NaN
4 1.2 NaN profile_1 0.20446 15.188
5 1.4 NaN profile_1 NaN 15.182
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55652704%2fmerge-multiple-dataframes-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
Consider setting index on each data frame and then run the horizontal merge with pd.concat
:
dfs = [df.set_index(['profile', 'depth']) for df in [df1, df2, df3]]
print(pd.concat(dfs, axis=1).reset_index())
# profile depth VAR1 VAR2 VAR3
# 0 profile_1 0.5 38.198002 NaN NaN
# 1 profile_1 0.6 38.198002 0.20440 NaN
# 2 profile_1 1.1 NaN 0.20442 NaN
# 3 profile_1 1.2 NaN 0.20446 15.188
# 4 profile_1 1.3 38.200001 NaN 15.182
# 5 profile_1 1.4 NaN NaN 15.182
that's awesome, thank you! How would you do it within a loop, for example:for m in range(len(myfiles))
: (where I read separate files for each df)df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
1
Ah, my mistake, do not bracket m which casts as list:dfs = [pd.read_csv(m, index_col=[0,1]) for m in myfiles]
– Parfait
15 hours ago
1
You have multiple rows with same profile AND depth. Originally you had that same issue in your post and I noticed you edited the first df's depth from 0.6 to 0.5. Try de-duping or aggregating before setting index and concatenating.
– Parfait
14 hours ago
1
I believe that is a different question and you already accepted a solution here (which come to think may result in a duplicate joins). Make an earnest effort and come back to SO with specific issues.
– Parfait
14 hours ago
1
You should close this one out as answers here does resolve your immediate question that even uses posted data. The data size and even data content with dups is a different question.
– Parfait
12 hours ago
|
show 8 more comments
Consider setting index on each data frame and then run the horizontal merge with pd.concat
:
dfs = [df.set_index(['profile', 'depth']) for df in [df1, df2, df3]]
print(pd.concat(dfs, axis=1).reset_index())
# profile depth VAR1 VAR2 VAR3
# 0 profile_1 0.5 38.198002 NaN NaN
# 1 profile_1 0.6 38.198002 0.20440 NaN
# 2 profile_1 1.1 NaN 0.20442 NaN
# 3 profile_1 1.2 NaN 0.20446 15.188
# 4 profile_1 1.3 38.200001 NaN 15.182
# 5 profile_1 1.4 NaN NaN 15.182
that's awesome, thank you! How would you do it within a loop, for example:for m in range(len(myfiles))
: (where I read separate files for each df)df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
1
Ah, my mistake, do not bracket m which casts as list:dfs = [pd.read_csv(m, index_col=[0,1]) for m in myfiles]
– Parfait
15 hours ago
1
You have multiple rows with same profile AND depth. Originally you had that same issue in your post and I noticed you edited the first df's depth from 0.6 to 0.5. Try de-duping or aggregating before setting index and concatenating.
– Parfait
14 hours ago
1
I believe that is a different question and you already accepted a solution here (which come to think may result in a duplicate joins). Make an earnest effort and come back to SO with specific issues.
– Parfait
14 hours ago
1
You should close this one out as answers here does resolve your immediate question that even uses posted data. The data size and even data content with dups is a different question.
– Parfait
12 hours ago
|
show 8 more comments
Consider setting index on each data frame and then run the horizontal merge with pd.concat
:
dfs = [df.set_index(['profile', 'depth']) for df in [df1, df2, df3]]
print(pd.concat(dfs, axis=1).reset_index())
# profile depth VAR1 VAR2 VAR3
# 0 profile_1 0.5 38.198002 NaN NaN
# 1 profile_1 0.6 38.198002 0.20440 NaN
# 2 profile_1 1.1 NaN 0.20442 NaN
# 3 profile_1 1.2 NaN 0.20446 15.188
# 4 profile_1 1.3 38.200001 NaN 15.182
# 5 profile_1 1.4 NaN NaN 15.182
Consider setting index on each data frame and then run the horizontal merge with pd.concat
:
dfs = [df.set_index(['profile', 'depth']) for df in [df1, df2, df3]]
print(pd.concat(dfs, axis=1).reset_index())
# profile depth VAR1 VAR2 VAR3
# 0 profile_1 0.5 38.198002 NaN NaN
# 1 profile_1 0.6 38.198002 0.20440 NaN
# 2 profile_1 1.1 NaN 0.20442 NaN
# 3 profile_1 1.2 NaN 0.20446 15.188
# 4 profile_1 1.3 38.200001 NaN 15.182
# 5 profile_1 1.4 NaN NaN 15.182
answered 15 hours ago
ParfaitParfait
54.3k104872
54.3k104872
that's awesome, thank you! How would you do it within a loop, for example:for m in range(len(myfiles))
: (where I read separate files for each df)df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
1
Ah, my mistake, do not bracket m which casts as list:dfs = [pd.read_csv(m, index_col=[0,1]) for m in myfiles]
– Parfait
15 hours ago
1
You have multiple rows with same profile AND depth. Originally you had that same issue in your post and I noticed you edited the first df's depth from 0.6 to 0.5. Try de-duping or aggregating before setting index and concatenating.
– Parfait
14 hours ago
1
I believe that is a different question and you already accepted a solution here (which come to think may result in a duplicate joins). Make an earnest effort and come back to SO with specific issues.
– Parfait
14 hours ago
1
You should close this one out as answers here does resolve your immediate question that even uses posted data. The data size and even data content with dups is a different question.
– Parfait
12 hours ago
|
show 8 more comments
that's awesome, thank you! How would you do it within a loop, for example:for m in range(len(myfiles))
: (where I read separate files for each df)df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
1
Ah, my mistake, do not bracket m which casts as list:dfs = [pd.read_csv(m, index_col=[0,1]) for m in myfiles]
– Parfait
15 hours ago
1
You have multiple rows with same profile AND depth. Originally you had that same issue in your post and I noticed you edited the first df's depth from 0.6 to 0.5. Try de-duping or aggregating before setting index and concatenating.
– Parfait
14 hours ago
1
I believe that is a different question and you already accepted a solution here (which come to think may result in a duplicate joins). Make an earnest effort and come back to SO with specific issues.
– Parfait
14 hours ago
1
You should close this one out as answers here does resolve your immediate question that even uses posted data. The data size and even data content with dups is a different question.
– Parfait
12 hours ago
that's awesome, thank you! How would you do it within a loop, for example:
for m in range(len(myfiles))
: (where I read separate files for each df) df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
that's awesome, thank you! How would you do it within a loop, for example:
for m in range(len(myfiles))
: (where I read separate files for each df) df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
1
1
Ah, my mistake, do not bracket m which casts as list:
dfs = [pd.read_csv(m, index_col=[0,1]) for m in myfiles]
– Parfait
15 hours ago
Ah, my mistake, do not bracket m which casts as list:
dfs = [pd.read_csv(m, index_col=[0,1]) for m in myfiles]
– Parfait
15 hours ago
1
1
You have multiple rows with same profile AND depth. Originally you had that same issue in your post and I noticed you edited the first df's depth from 0.6 to 0.5. Try de-duping or aggregating before setting index and concatenating.
– Parfait
14 hours ago
You have multiple rows with same profile AND depth. Originally you had that same issue in your post and I noticed you edited the first df's depth from 0.6 to 0.5. Try de-duping or aggregating before setting index and concatenating.
– Parfait
14 hours ago
1
1
I believe that is a different question and you already accepted a solution here (which come to think may result in a duplicate joins). Make an earnest effort and come back to SO with specific issues.
– Parfait
14 hours ago
I believe that is a different question and you already accepted a solution here (which come to think may result in a duplicate joins). Make an earnest effort and come back to SO with specific issues.
– Parfait
14 hours ago
1
1
You should close this one out as answers here does resolve your immediate question that even uses posted data. The data size and even data content with dups is a different question.
– Parfait
12 hours ago
You should close this one out as answers here does resolve your immediate question that even uses posted data. The data size and even data content with dups is a different question.
– Parfait
12 hours ago
|
show 8 more comments
Or using merge
:
from functools import partial, reduce
dfs = [df1,df2,df3]
merge = partial(pd.merge, on=['depth','profile'], how='outer')
reduce(merge, dfs)
depth VAR1 profile VAR2 VAR3
0 0.6 38.198002 profile_1 0.20440 NaN
1 0.6 38.198002 profile_1 0.20440 NaN
2 1.3 38.200001 profile_1 NaN 15.182
3 1.1 NaN profile_1 0.20442 NaN
4 1.2 NaN profile_1 0.20446 15.188
5 1.4 NaN profile_1 NaN 15.182
Update
For merging the dataframes in a loop as suggested in the comments, you could do something like:
df_final = pd.DataFrame(columns=df1.columns)
for df in dfs:
df_final = df_final.merge(df, on=['depth','profile'], how='outer')
that's awesome, thank you! How would you do it within a loop, for example:for m in range(len(myfiles))
: (where I read separate files for each df)df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
1
Well the main purpose of reduce here is to avoid a loop. If you prefer that approach I assume for memory constraints, you need a single merge on each iteration. Simply update the resulting dataframe on each loop
– yatu
15 hours ago
thank you, that's super helpful, but would you perhaps care to show how such an iteration would look like, perhaps just here as a comment? I'm not really sure how to continue
– PEBKAC
15 hours ago
1
Check the update @PEBKAC
– yatu
15 hours ago
1
Well if you have to end up merging them all, you likely won't be able to obtain the final dataframe anyway. I'd suggest you to work with chunks of data. Check stackoverflow.com/questions/47386405/…
– yatu
14 hours ago
|
show 4 more comments
Or using merge
:
from functools import partial, reduce
dfs = [df1,df2,df3]
merge = partial(pd.merge, on=['depth','profile'], how='outer')
reduce(merge, dfs)
depth VAR1 profile VAR2 VAR3
0 0.6 38.198002 profile_1 0.20440 NaN
1 0.6 38.198002 profile_1 0.20440 NaN
2 1.3 38.200001 profile_1 NaN 15.182
3 1.1 NaN profile_1 0.20442 NaN
4 1.2 NaN profile_1 0.20446 15.188
5 1.4 NaN profile_1 NaN 15.182
Update
For merging the dataframes in a loop as suggested in the comments, you could do something like:
df_final = pd.DataFrame(columns=df1.columns)
for df in dfs:
df_final = df_final.merge(df, on=['depth','profile'], how='outer')
that's awesome, thank you! How would you do it within a loop, for example:for m in range(len(myfiles))
: (where I read separate files for each df)df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
1
Well the main purpose of reduce here is to avoid a loop. If you prefer that approach I assume for memory constraints, you need a single merge on each iteration. Simply update the resulting dataframe on each loop
– yatu
15 hours ago
thank you, that's super helpful, but would you perhaps care to show how such an iteration would look like, perhaps just here as a comment? I'm not really sure how to continue
– PEBKAC
15 hours ago
1
Check the update @PEBKAC
– yatu
15 hours ago
1
Well if you have to end up merging them all, you likely won't be able to obtain the final dataframe anyway. I'd suggest you to work with chunks of data. Check stackoverflow.com/questions/47386405/…
– yatu
14 hours ago
|
show 4 more comments
Or using merge
:
from functools import partial, reduce
dfs = [df1,df2,df3]
merge = partial(pd.merge, on=['depth','profile'], how='outer')
reduce(merge, dfs)
depth VAR1 profile VAR2 VAR3
0 0.6 38.198002 profile_1 0.20440 NaN
1 0.6 38.198002 profile_1 0.20440 NaN
2 1.3 38.200001 profile_1 NaN 15.182
3 1.1 NaN profile_1 0.20442 NaN
4 1.2 NaN profile_1 0.20446 15.188
5 1.4 NaN profile_1 NaN 15.182
Update
For merging the dataframes in a loop as suggested in the comments, you could do something like:
df_final = pd.DataFrame(columns=df1.columns)
for df in dfs:
df_final = df_final.merge(df, on=['depth','profile'], how='outer')
Or using merge
:
from functools import partial, reduce
dfs = [df1,df2,df3]
merge = partial(pd.merge, on=['depth','profile'], how='outer')
reduce(merge, dfs)
depth VAR1 profile VAR2 VAR3
0 0.6 38.198002 profile_1 0.20440 NaN
1 0.6 38.198002 profile_1 0.20440 NaN
2 1.3 38.200001 profile_1 NaN 15.182
3 1.1 NaN profile_1 0.20442 NaN
4 1.2 NaN profile_1 0.20446 15.188
5 1.4 NaN profile_1 NaN 15.182
Update
For merging the dataframes in a loop as suggested in the comments, you could do something like:
df_final = pd.DataFrame(columns=df1.columns)
for df in dfs:
df_final = df_final.merge(df, on=['depth','profile'], how='outer')
edited 15 hours ago
answered 15 hours ago
yatuyatu
15.8k41642
15.8k41642
that's awesome, thank you! How would you do it within a loop, for example:for m in range(len(myfiles))
: (where I read separate files for each df)df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
1
Well the main purpose of reduce here is to avoid a loop. If you prefer that approach I assume for memory constraints, you need a single merge on each iteration. Simply update the resulting dataframe on each loop
– yatu
15 hours ago
thank you, that's super helpful, but would you perhaps care to show how such an iteration would look like, perhaps just here as a comment? I'm not really sure how to continue
– PEBKAC
15 hours ago
1
Check the update @PEBKAC
– yatu
15 hours ago
1
Well if you have to end up merging them all, you likely won't be able to obtain the final dataframe anyway. I'd suggest you to work with chunks of data. Check stackoverflow.com/questions/47386405/…
– yatu
14 hours ago
|
show 4 more comments
that's awesome, thank you! How would you do it within a loop, for example:for m in range(len(myfiles))
: (where I read separate files for each df)df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
1
Well the main purpose of reduce here is to avoid a loop. If you prefer that approach I assume for memory constraints, you need a single merge on each iteration. Simply update the resulting dataframe on each loop
– yatu
15 hours ago
thank you, that's super helpful, but would you perhaps care to show how such an iteration would look like, perhaps just here as a comment? I'm not really sure how to continue
– PEBKAC
15 hours ago
1
Check the update @PEBKAC
– yatu
15 hours ago
1
Well if you have to end up merging them all, you likely won't be able to obtain the final dataframe anyway. I'd suggest you to work with chunks of data. Check stackoverflow.com/questions/47386405/…
– yatu
14 hours ago
that's awesome, thank you! How would you do it within a loop, for example:
for m in range(len(myfiles))
: (where I read separate files for each df) df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
that's awesome, thank you! How would you do it within a loop, for example:
for m in range(len(myfiles))
: (where I read separate files for each df) df = pd.read_csv(myfiles[m])
– PEBKAC
15 hours ago
1
1
Well the main purpose of reduce here is to avoid a loop. If you prefer that approach I assume for memory constraints, you need a single merge on each iteration. Simply update the resulting dataframe on each loop
– yatu
15 hours ago
Well the main purpose of reduce here is to avoid a loop. If you prefer that approach I assume for memory constraints, you need a single merge on each iteration. Simply update the resulting dataframe on each loop
– yatu
15 hours ago
thank you, that's super helpful, but would you perhaps care to show how such an iteration would look like, perhaps just here as a comment? I'm not really sure how to continue
– PEBKAC
15 hours ago
thank you, that's super helpful, but would you perhaps care to show how such an iteration would look like, perhaps just here as a comment? I'm not really sure how to continue
– PEBKAC
15 hours ago
1
1
Check the update @PEBKAC
– yatu
15 hours ago
Check the update @PEBKAC
– yatu
15 hours ago
1
1
Well if you have to end up merging them all, you likely won't be able to obtain the final dataframe anyway. I'd suggest you to work with chunks of data. Check stackoverflow.com/questions/47386405/…
– yatu
14 hours ago
Well if you have to end up merging them all, you likely won't be able to obtain the final dataframe anyway. I'd suggest you to work with chunks of data. Check stackoverflow.com/questions/47386405/…
– yatu
14 hours ago
|
show 4 more comments
I would use append.
>>> df1.append(df2).append(df3).sort_values('depth')
VAR1 VAR2 VAR3 depth profile
0 38.196202 NaN NaN 0.5 profile_1
1 38.198002 NaN NaN 0.6 profile_1
0 NaN 0.20440 NaN 0.6 profile_1
1 NaN 0.20442 NaN 1.1 profile_1
2 NaN 0.20446 NaN 1.2 profile_1
0 NaN NaN 15.188 1.2 profile_1
2 38.200001 NaN NaN 1.3 profile_1
1 NaN NaN 15.182 1.3 profile_1
2 NaN NaN 15.182 1.4 profile_1
Obviously if you have a lot of dataframes, just make a list and loop through them.
thank you! @BlivetWidget, how do you sort it both by depth AND profile? each profile has a set of depths and each dataframe has a bunch of profiles?
– PEBKAC
10 hours ago
1
@PEBKAC you can sort it by however many parameters you want, in whatever order you want. .sort_values(['depth', 'profile']) or .sort_values(['profile', 'depth']). You can check the help on df1.sort_values to learn how to change the sort order, to sort in place, and various other optional parameters.
– BlivetWidget
10 hours ago
thank you, most helpful!
– PEBKAC
10 hours ago
add a comment |
I would use append.
>>> df1.append(df2).append(df3).sort_values('depth')
VAR1 VAR2 VAR3 depth profile
0 38.196202 NaN NaN 0.5 profile_1
1 38.198002 NaN NaN 0.6 profile_1
0 NaN 0.20440 NaN 0.6 profile_1
1 NaN 0.20442 NaN 1.1 profile_1
2 NaN 0.20446 NaN 1.2 profile_1
0 NaN NaN 15.188 1.2 profile_1
2 38.200001 NaN NaN 1.3 profile_1
1 NaN NaN 15.182 1.3 profile_1
2 NaN NaN 15.182 1.4 profile_1
Obviously if you have a lot of dataframes, just make a list and loop through them.
thank you! @BlivetWidget, how do you sort it both by depth AND profile? each profile has a set of depths and each dataframe has a bunch of profiles?
– PEBKAC
10 hours ago
1
@PEBKAC you can sort it by however many parameters you want, in whatever order you want. .sort_values(['depth', 'profile']) or .sort_values(['profile', 'depth']). You can check the help on df1.sort_values to learn how to change the sort order, to sort in place, and various other optional parameters.
– BlivetWidget
10 hours ago
thank you, most helpful!
– PEBKAC
10 hours ago
add a comment |
I would use append.
>>> df1.append(df2).append(df3).sort_values('depth')
VAR1 VAR2 VAR3 depth profile
0 38.196202 NaN NaN 0.5 profile_1
1 38.198002 NaN NaN 0.6 profile_1
0 NaN 0.20440 NaN 0.6 profile_1
1 NaN 0.20442 NaN 1.1 profile_1
2 NaN 0.20446 NaN 1.2 profile_1
0 NaN NaN 15.188 1.2 profile_1
2 38.200001 NaN NaN 1.3 profile_1
1 NaN NaN 15.182 1.3 profile_1
2 NaN NaN 15.182 1.4 profile_1
Obviously if you have a lot of dataframes, just make a list and loop through them.
I would use append.
>>> df1.append(df2).append(df3).sort_values('depth')
VAR1 VAR2 VAR3 depth profile
0 38.196202 NaN NaN 0.5 profile_1
1 38.198002 NaN NaN 0.6 profile_1
0 NaN 0.20440 NaN 0.6 profile_1
1 NaN 0.20442 NaN 1.1 profile_1
2 NaN 0.20446 NaN 1.2 profile_1
0 NaN NaN 15.188 1.2 profile_1
2 38.200001 NaN NaN 1.3 profile_1
1 NaN NaN 15.182 1.3 profile_1
2 NaN NaN 15.182 1.4 profile_1
Obviously if you have a lot of dataframes, just make a list and loop through them.
edited 15 hours ago
answered 15 hours ago
BlivetWidgetBlivetWidget
3,7991922
3,7991922
thank you! @BlivetWidget, how do you sort it both by depth AND profile? each profile has a set of depths and each dataframe has a bunch of profiles?
– PEBKAC
10 hours ago
1
@PEBKAC you can sort it by however many parameters you want, in whatever order you want. .sort_values(['depth', 'profile']) or .sort_values(['profile', 'depth']). You can check the help on df1.sort_values to learn how to change the sort order, to sort in place, and various other optional parameters.
– BlivetWidget
10 hours ago
thank you, most helpful!
– PEBKAC
10 hours ago
add a comment |
thank you! @BlivetWidget, how do you sort it both by depth AND profile? each profile has a set of depths and each dataframe has a bunch of profiles?
– PEBKAC
10 hours ago
1
@PEBKAC you can sort it by however many parameters you want, in whatever order you want. .sort_values(['depth', 'profile']) or .sort_values(['profile', 'depth']). You can check the help on df1.sort_values to learn how to change the sort order, to sort in place, and various other optional parameters.
– BlivetWidget
10 hours ago
thank you, most helpful!
– PEBKAC
10 hours ago
thank you! @BlivetWidget, how do you sort it both by depth AND profile? each profile has a set of depths and each dataframe has a bunch of profiles?
– PEBKAC
10 hours ago
thank you! @BlivetWidget, how do you sort it both by depth AND profile? each profile has a set of depths and each dataframe has a bunch of profiles?
– PEBKAC
10 hours ago
1
1
@PEBKAC you can sort it by however many parameters you want, in whatever order you want. .sort_values(['depth', 'profile']) or .sort_values(['profile', 'depth']). You can check the help on df1.sort_values to learn how to change the sort order, to sort in place, and various other optional parameters.
– BlivetWidget
10 hours ago
@PEBKAC you can sort it by however many parameters you want, in whatever order you want. .sort_values(['depth', 'profile']) or .sort_values(['profile', 'depth']). You can check the help on df1.sort_values to learn how to change the sort order, to sort in place, and various other optional parameters.
– BlivetWidget
10 hours ago
thank you, most helpful!
– PEBKAC
10 hours ago
thank you, most helpful!
– PEBKAC
10 hours ago
add a comment |
Why not concatenate all the Data Frames, melt, then reform them using your ids? There might be a more efficient way to do this, but this works.
df=pd.melt(pd.concat([df1,df2,df3]),id_vars=['profile','depth'])
df_pivot=df.pivot_table(index=['profile','depth'],columns='variable',values='value')
Where df_pivot
will be
variable VAR1 VAR2 VAR3
profile depth
profile_1 0.5 38.196202 NaN NaN
0.6 38.198002 0.20440 NaN
1.1 NaN 0.20442 NaN
1.2 NaN 0.20446 15.188
1.3 38.200001 NaN 15.182
1.4 NaN NaN 15.182
add a comment |
Why not concatenate all the Data Frames, melt, then reform them using your ids? There might be a more efficient way to do this, but this works.
df=pd.melt(pd.concat([df1,df2,df3]),id_vars=['profile','depth'])
df_pivot=df.pivot_table(index=['profile','depth'],columns='variable',values='value')
Where df_pivot
will be
variable VAR1 VAR2 VAR3
profile depth
profile_1 0.5 38.196202 NaN NaN
0.6 38.198002 0.20440 NaN
1.1 NaN 0.20442 NaN
1.2 NaN 0.20446 15.188
1.3 38.200001 NaN 15.182
1.4 NaN NaN 15.182
add a comment |
Why not concatenate all the Data Frames, melt, then reform them using your ids? There might be a more efficient way to do this, but this works.
df=pd.melt(pd.concat([df1,df2,df3]),id_vars=['profile','depth'])
df_pivot=df.pivot_table(index=['profile','depth'],columns='variable',values='value')
Where df_pivot
will be
variable VAR1 VAR2 VAR3
profile depth
profile_1 0.5 38.196202 NaN NaN
0.6 38.198002 0.20440 NaN
1.1 NaN 0.20442 NaN
1.2 NaN 0.20446 15.188
1.3 38.200001 NaN 15.182
1.4 NaN NaN 15.182
Why not concatenate all the Data Frames, melt, then reform them using your ids? There might be a more efficient way to do this, but this works.
df=pd.melt(pd.concat([df1,df2,df3]),id_vars=['profile','depth'])
df_pivot=df.pivot_table(index=['profile','depth'],columns='variable',values='value')
Where df_pivot
will be
variable VAR1 VAR2 VAR3
profile depth
profile_1 0.5 38.196202 NaN NaN
0.6 38.198002 0.20440 NaN
1.1 NaN 0.20442 NaN
1.2 NaN 0.20446 15.188
1.3 38.200001 NaN 15.182
1.4 NaN NaN 15.182
answered 15 hours ago
SEpapoulisSEpapoulis
463
463
add a comment |
add a comment |
You can also use:
dfs = [df1, df2, df3]
df = pd.merge(dfs[0], dfs[1], left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
for d in dfs[2:]:
df = pd.merge(df, d, left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
depth VAR1 profile VAR2 VAR3
0 0.5 38.196202 profile_1 NaN NaN
1 0.6 38.198002 profile_1 0.20440 NaN
2 1.3 38.200001 profile_1 NaN 15.182
3 1.1 NaN profile_1 0.20442 NaN
4 1.2 NaN profile_1 0.20446 15.188
5 1.4 NaN profile_1 NaN 15.182
add a comment |
You can also use:
dfs = [df1, df2, df3]
df = pd.merge(dfs[0], dfs[1], left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
for d in dfs[2:]:
df = pd.merge(df, d, left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
depth VAR1 profile VAR2 VAR3
0 0.5 38.196202 profile_1 NaN NaN
1 0.6 38.198002 profile_1 0.20440 NaN
2 1.3 38.200001 profile_1 NaN 15.182
3 1.1 NaN profile_1 0.20442 NaN
4 1.2 NaN profile_1 0.20446 15.188
5 1.4 NaN profile_1 NaN 15.182
add a comment |
You can also use:
dfs = [df1, df2, df3]
df = pd.merge(dfs[0], dfs[1], left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
for d in dfs[2:]:
df = pd.merge(df, d, left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
depth VAR1 profile VAR2 VAR3
0 0.5 38.196202 profile_1 NaN NaN
1 0.6 38.198002 profile_1 0.20440 NaN
2 1.3 38.200001 profile_1 NaN 15.182
3 1.1 NaN profile_1 0.20442 NaN
4 1.2 NaN profile_1 0.20446 15.188
5 1.4 NaN profile_1 NaN 15.182
You can also use:
dfs = [df1, df2, df3]
df = pd.merge(dfs[0], dfs[1], left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
for d in dfs[2:]:
df = pd.merge(df, d, left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
depth VAR1 profile VAR2 VAR3
0 0.5 38.196202 profile_1 NaN NaN
1 0.6 38.198002 profile_1 0.20440 NaN
2 1.3 38.200001 profile_1 NaN 15.182
3 1.1 NaN profile_1 0.20442 NaN
4 1.2 NaN profile_1 0.20446 15.188
5 1.4 NaN profile_1 NaN 15.182
answered 15 hours ago
heena bawaheena bawa
59645
59645
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55652704%2fmerge-multiple-dataframes-pandas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown