Tuesday, October 11, 2022

[SOLVED] How do I gsub the complete time string behind @

October 11, 2022 grep, gsub, r, regex, time

Issue

(this is my first question, if i need to improve anything about it, pls let me know!)

I am analysing a large observational dataset. start and stop time of each observation have been indicated so that i was able to calculate the duration. But there is a note column which includes information on "pauses" / "breaks" or "out of sight" periods in which the animal was not seen. I would like to subtract those time periods from total duration.

My problem is, one column includes several notes, not only pauses ("HH:MM-HH:MM") but also info on certain events (xy happened "@HH:MM").

I only want to look at time periods in the format of HH:MM-HH:MM and i want to exclude all event times labeled "@HH:MM". I've managed to drop all words and be left with only numbers, so it looks like this

id <- c("3990", "3989", "3004")

timepoints <- c("@6:19,,7:16-7:23,7:25-7:43,@7:53,", "@6:19,,7:25-7:43,@7:53", "7:30-7:39,7:45-7:48,7:49-7:54")

df <- data.frame(id, timepoints)

tried several ways of grep or gsub trying to indicate, either which to keep, or which to leave out but i failed. The closest I got was r dropping "@HH" but keeping ":MM". for this I used

gsub("@([[:digit:]]|[_])*", "", df$timepoints)

, as found for a similar problem just with words here: remove all words that start with "@" from a string

The aim is to get (e.g.):

id	timepoints
3990	"7:16-7:23, 7:25-7:43"

id	timepoints
3990	"7:16-7:23", "7:25-7:43"

If possible separated by comma, or directly separated into different columns so i can extract the time and subtract it from my total observation time.

Any help would be greatly appreciated!

Solution

How about matching the strings you're interested in instead?

With base:

df$new_timepoints <- regmatches(df$timepoints, gregexpr("\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}", df$timepoints))

Output (with a list column):

    id                        timepoints                  new_timepoints
1 3990 @6:19,,7:16-7:23,7:25-7:43,@7:53,            7:16-7:23, 7:25-7:43
2 3989            @6:19,,7:25-7:43,@7:53                       7:25-7:43
3 3004     7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39, 7:45-7:48, 7:49-7:54

With tidyverse (in a long format for easy calculations!):

library(stringr)
library(dplyr)
library(tidyr)

df |>
  group_by(id) |>
  mutate(new_timepoints = str_extract_all(timepoints, "\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}")) |>
  unnest_longer(new_timepoints) |>
  ungroup()

Output:

# A tibble: 6 × 3
  id    timepoints                        new_timepoints
  <chr> <chr>                             <chr>         
1 3990  @6:19,,7:16-7:23,7:25-7:43,@7:53, 7:16-7:23     
2 3990  @6:19,,7:16-7:23,7:25-7:43,@7:53, 7:25-7:43     
3 3989  @6:19,,7:25-7:43,@7:53            7:25-7:43     
4 3004  7:30-7:39,7:45-7:48,7:49-7:54     7:30-7:39     
5 3004  7:30-7:39,7:45-7:48,7:49-7:54     7:45-7:48     
6 3004  7:30-7:39,7:45-7:48,7:49-7:54     7:49-7:54

Answered By - harre

Answer Checked By - Clifford M. (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 11, 2022

[SOLVED] How do I gsub the complete time string behind @

Issue

Solution

Popular Posts

Labels