Skip to content

Zip metadata files#390

Closed
GWMcElfresh wants to merge 1 commit intoBimberLab:discvr-26.3from
GWMcElfresh:discvr-26.3
Closed

Zip metadata files#390
GWMcElfresh wants to merge 1 commit intoBimberLab:discvr-26.3from
GWMcElfresh:discvr-26.3

Conversation

@GWMcElfresh
Copy link
Copy Markdown

Rationale

Currently, when saving the metadata to disk, the files aren't gzipped, although I believe this is the intention given the file name here:

metaFile <- paste0(outputPrefix, '.', datasetIdForFile, '.seurat.meta.txt.gz')

Related Pull Requests

entertainingly, this is likely why the metadata files are probably large here: bimberlabinternal/Rdiscvr#68 but nothing direct.

Changes

single line change here to zip the file, which should automatically clean up the connection as well.

write.table(metaDf, file = metaFile, quote = T, row.names = F, sep = ',', col.names = T)

@bbimber
Copy link
Copy Markdown
Contributor

bbimber commented Apr 10, 2026

@bbimber bbimber closed this Apr 10, 2026
@bbimber
Copy link
Copy Markdown
Contributor

bbimber commented Apr 10, 2026

it's sorta moot here, but are you sure we dont need to explicitly close the file connection? google suggests yes:

# 1. Create and open the connection
con <- gzfile("data.csv.gz", "w")

# 2. Write data to the connection
write.csv(mtcars, con)

# 3. Explicitly close the connection
close(con)

@GWMcElfresh
Copy link
Copy Markdown
Author

My understanding is that because we're not explicitly opening the connection with conn <- gzfile(...), and instead call it on the fly during the write.table() call, R is smart about it and closes it automatically. Otherwise yeah, we'd need to close it.

@bbimber
Copy link
Copy Markdown
Contributor

bbimber commented Apr 10, 2026

My understanding is that because we're not explicitly opening the connection with conn <- gzfile(...), and instead call it on the fly during the write.table() call, R is smart about it and closes it automatically. Otherwise yeah, we'd need to close it.

OK, I dont really know. It's possible you beat gemini on this one

@GWMcElfresh
Copy link
Copy Markdown
Author

@GWMcElfresh, doesnt it already gzip them? not here, but downstream?

https://prime-seq.ohsu.edu/Labs/Bimber/1947/pipeline-browse.view?returnUrl=%2FLabs%2FBimber%2F1947%2Fpipeline-status-details.view%3FrowId%3D615309&path=sequenceOutputPipeline%2FSequenceOutput_2026-04-09_16-08-50

I don't think it gzips them - it's just named .gz. Or if it does gzip them at some point, the files are at least not gzipped when they're written.

See this:

$du -sh SingleCell.CustomUCell.EC_V4_Spleen_Myeloid.seurat.meta.txt.gz
404M    SingleCell.CustomUCell.EC_V4_Spleen_Myeloid.seurat.meta.txt.gz

$mv SingleCell.CustomUCell.EC_V4_Spleen_Myeloid.seurat.meta.txt.gz file_without_gz_suffix.txt
$gzip file_without_gz_suffix.txt

$du -sh file_without_gz_suffix.txt.gz
115M    file_without_gz_suffix.txt.gz

Further - nano (vim etc gunzip on the fly) on the original file:
image

nano on the gzipped file:
image

@bbimber
Copy link
Copy Markdown
Contributor

bbimber commented Apr 10, 2026

OK, you are correct. I will fix that; however, I want to also make something to retroactively gzip existing files.

I googled the write.table(gzipfile()) pattern, and I cannot find anything that says this will automatically handle closing the connection. Do you see something different?

@bbimber bbimber reopened this Apr 10, 2026
@GWMcElfresh
Copy link
Copy Markdown
Author

This is apparently complicated (i.e. depends on the function used to write) - I'll test with delta-ing showConnections() calls in the morning.

Worst case scenario is explicitly define the connection and close()ing it.

@bbimber
Copy link
Copy Markdown
Contributor

bbimber commented Apr 10, 2026

It's certainly worth knowing if there's something better (seems like switching from base R to tidyr or something would do that); however, this is a base R solution that should work:

793037a

@bbimber bbimber closed this Apr 10, 2026
@GWMcElfresh
Copy link
Copy Markdown
Author

thanks! I'll see if I can reproduce that linked Rdiscvr issue whenever the retrospective gzip happens too. I think it was probably auto-detecting the file as text, and the httr parsing issue probably doesn't happen on non-text files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants