Conversation
These functions will be added to data_processing.R once approved. The scripts are modifications of @epbrenner 's hmmering and rhmmer.
Removed redundant line reading from file as it was already handled earlier in the code.
jananiravi
left a comment
There was a problem hiding this comment.
It looks like you moved a line -- unless I'm missing something, it's good to merge!
jananiravi
left a comment
There was a problem hiding this comment.
I see that I commented on one commit earlier -- sorry about that!
In principle, it looks good. I would like to request @eboyer221 or @epbrenner to run this locally and suggest non-alpine placeholders to ensure this works for all!
| combined_drug_data <- unlist(batch_drug_data, use.names = FALSE) | ||
| if (length(combined_drug_data) == 0) { message("No drug data returned."); return(NULL) } | ||
| if (length(combined_drug_data) == 0) { | ||
| message("No drug data returned.") |
| combined_genome_data <- unlist(batch_genome_data, use.names = FALSE) | ||
| if (length(combined_genome_data) == 0) { message("No genome data returned."); return(NULL) } | ||
| if (length(combined_genome_data) == 0) { | ||
| message("No genome data returned.") |
There was a problem hiding this comment.
found/returned/retrieved? same Q as before.
| chunk_size <- ceiling(length(records) / chunk_count) | ||
| chunks <- split(records, ceiling(seq_along(records) / chunk_size)) | ||
|
|
||
| purrr::walk2(chunks, seq_along(chunks), function(chunk, i) { |
There was a problem hiding this comment.
| "exec", | ||
| "-B", paste0(mount_host, ":", mount_cont), | ||
| "-B", paste0(db_host_dir, ":", db_cont_dir), | ||
| "/scratch/alpine/aghosh5@xsede.org/software/hmmer_latest.sif", |
|
|
||
| message("Combined parquet written") | ||
|
|
||
| # arrow::read_parquet("/scratch/alpine/aghosh5@xsede.org/AMR/data/Campylobacter_jejuni/protein_COG_count.parquet") |> DBI::dbWriteTable(conn=con, name="protein_COG_count") |
There was a problem hiding this comment.
hardcoded path alert. cannot be part of the public amRdata repo.
| cdhit_extra_args = c("-g", "1"), | ||
| cdhit_output_prefix = "cdhit_out", | ||
| # InterPro | ||
| ipr_appl = c("Pfam"), |
There was a problem hiding this comment.
user can switch: Pfam vs. something else? @AbhirupaGhosh @epbrenner
|
|
||
| .runHMMER <- function(duckdb_path, | ||
| output_path, | ||
| threads = 0, |
There was a problem hiding this comment.
| threads = 0, | |
| threads = 1, | |
| n_workers = 1, |
| # number of parallel jobs (NOT threads per hmmscan) | ||
| n_workers <- 4 | ||
|
|
||
| # threads per hmmscan | ||
| threads <- 8 |
There was a problem hiding this comment.
| # number of parallel jobs (NOT threads per hmmscan) | |
| n_workers <- 4 | |
| # threads per hmmscan | |
| threads <- 8 |
|
Just had a thought, we have to run each HMMER database in the function separately and then combine the outputs later? |
These functions will be added to data_processing.R once approved.
The scripts are modifications of @epbrenner 's hmmering and rhmmer.
Description
What kind of change(s) are included?
Checklist
Please ensure that all boxes are checked before indicating that this pull request is ready for review.