Issue
(this is my first question, if i need to improve anything about it, pls let me know!)
I am analysing a large observational dataset. start and stop time of each observation have been indicated so that i was able to calculate the duration. But there is a note column which includes information on "pauses" / "breaks" or "out of sight" periods in which the animal was not seen. I would like to subtract those time periods from total duration.
My problem is, one column includes several notes, not only pauses ("HH:MM-HH:MM") but also info on certain events (xy happened "@HH:MM").
I only want to look at time periods in the format of HH:MM-HH:MM and i want to exclude all event times labeled "@HH:MM". I've managed to drop all words and be left with only numbers, so it looks like this
id <- c("3990", "3989", "3004")
timepoints <- c("@6:19,,7:16-7:23,7:25-7:43,@7:53,", "@6:19,,7:25-7:43,@7:53", "7:30-7:39,7:45-7:48,7:49-7:54")
df <- data.frame(id, timepoints)
tried several ways of grep or gsub trying to indicate, either which to keep, or which to leave out but i failed. The closest I got was r dropping "@HH" but keeping ":MM". for this I used
gsub("@([[:digit:]]|[_])*", "", df$timepoints)
, as found for a similar problem just with words here: remove all words that start with "@" from a string
The aim is to get (e.g.):
id | timepoints |
---|---|
3990 | "7:16-7:23, 7:25-7:43" |
or
id | timepoints |
---|---|
3990 | "7:16-7:23", "7:25-7:43" |
If possible separated by comma, or directly separated into different columns so i can extract the time and subtract it from my total observation time.
Any help would be greatly appreciated!
Solution
How about matching the strings you're interested in instead?
With base
:
df$new_timepoints <- regmatches(df$timepoints, gregexpr("\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}", df$timepoints))
Output (with a list column):
id timepoints new_timepoints
1 3990 @6:19,,7:16-7:23,7:25-7:43,@7:53, 7:16-7:23, 7:25-7:43
2 3989 @6:19,,7:25-7:43,@7:53 7:25-7:43
3 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39, 7:45-7:48, 7:49-7:54
With tidyverse
(in a long format for easy calculations!):
library(stringr)
library(dplyr)
library(tidyr)
df |>
group_by(id) |>
mutate(new_timepoints = str_extract_all(timepoints, "\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}")) |>
unnest_longer(new_timepoints) |>
ungroup()
Output:
# A tibble: 6 × 3
id timepoints new_timepoints
<chr> <chr> <chr>
1 3990 @6:19,,7:16-7:23,7:25-7:43,@7:53, 7:16-7:23
2 3990 @6:19,,7:16-7:23,7:25-7:43,@7:53, 7:25-7:43
3 3989 @6:19,,7:25-7:43,@7:53 7:25-7:43
4 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39
5 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:45-7:48
6 3004 7:30-7:39,7:45-7:48,7:49-7:54 7:49-7:54
Answered By - harre Answer Checked By - Clifford M. (WPSolving Volunteer)