[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y ] [Search | Free Show | Home]

Managing large images collections

This is a blue board which means that it's for everybody (Safe For Work content only). If you see any adult content, please report it.

Thread replies: 47
Thread images: 6

File: 1486491373421.jpg (350KB, 1200x1200px) Image search: [Google]
1486491373421.jpg
350KB, 1200x1200px
Hello,

As an imageboard user and data hoarder, I save all the content I find interesting, amusing, or arousing. You know this very well since you are probably the same way.

I have several thousands of images sitting on my hard drive, most of which have been saved by hand. This is content I dedicate time for searching, so it is pretty special. I have several thousands more images but those are scrapped, so I have no troubles organizing them. I have postponed the sorting of these images for a while and hand-sorting them is just painful and time consuming. At the moment, I have about 5000 images left to organise in around 200 thematic folders. I thought about deploying simple machine-learning based solutions using python and tensorflow, but they are unlikely to fit the granularity level I go for, since vastly different images can fit together in the same folder (think of memes, as an example). Also, organizing the dataset by hand for learning would be pretty much what I am trying to accomplish right now.

Do you have a similar collection ? Do you have a set-up to manage it ?
>>
Some anon told me about this: https://github.com/hydrusnetwork/hydrus
>>
>>60860525
>https://github.com/hydrusnetwork/hydrus

This is incredible and very much valuable, thank you anon.
>>
>>60860525
Interesting. Thanks.
>>
>>60860486
Seconding Hydrus. Just use that.

Note that it will maintain it's own file and folder structure (all hash based).

It doesn't preserve YOUR folders and filenames, though you can grab them into the DB at import time.
>>
>>60860586
PS: To get other people's tags -which is one of the main uses this tool has, adding the PTR -the public tag repository by/for hydrus users- is the most important step:
https://hydrusnetwork.github.io/hydrus/help/getting_started_tags.html

You can also add downloaded tag database dump files (like from a *booru) in almost the same place, but the PTR is the big one.
>>
hydrus network is so freaking gay

I keep a complex file structure of anime images sorted by character, pairings, show ect ect
and then elo ranked using tournaments and manual comparisons and some fancy algorymths
and then rip tags from boorus using a program i wrote and embed the tags as metedata

coming soon: ranking individual tags using microsoft TRUESKILL and a gui to put it all together and btfo hydrus once and for all

btw hydrus copies all of your images into its own folder, freaking LAME
>>
>>60861836
You have autism
>>
last time I used hydrus, it lagged horribly with large image collections. was this ever fixed?
>>
>>60861836
>btw hydrus copies all of your images into its own folder, freaking LAME
Wait, it does? That's not gonna work out when I have around 40gb of images saved.
>>
>>60861853
anyone who considers using hydrus network is reddit and also im not autistic
>>
>>60861899
Hydrus is shit, yes, but you have autism.

>>60861875
Yes, the creator somehow thought it a bad idea to just symlink and create a db of hashes based on that.
Find a download for Google Picasa instead.
>>
>>60861864
It doesn't really lag for me, but of course querying the DB and loading thumbnails gets slower if you have a bunch of million files and search for a bunch of tags.
>>
>>60861875
It can move them into its own folder. Doesn't require any more storage space then.

But it will manage files and folders hash-based, yes. It doesn't do another mapping between hash and file storage location.
>>
Side question ... is there a self-hosted booru that clones the tags from the public ones?
>>
>>60861921
> Yes, the creator somehow thought it a bad idea to just symlink and create a db of hashes based on that.
It is a problem, symlinks / NTFS junctions like other things aren't going to work on all filesystems and security policies that are still in use. Symlinks are generally used sparingly and only by entities that fully do sysadmin stuff.

Also see:
> why not use filenames and folders?
> can the client manage files from their original locations?
https://hydrusnetwork.github.io/hydrus/help/faq.html

It's a choice for more performance and less file management problems.
>>
>>60863991
Yes, that's exactly what Hydrus is doing.

Apart from the public tag repository you can use tag database dumps scraped from all the typical boorus and such.

And it CAN host a local web-browser based booru (https://hydrusnetwork.github.io/hydrus/help/local_booru.html) apart from the python GUI client.
>>
>>60863999
if the dev wasn't a freaking retard he would put the metadata inside the images and kept a db of hashes
>>
>>60864019
It's freaking retarded to alter all the images, it will change their hashes.

And actually not all formats supported will just take some metadata field.

And yes, this IS using a db of hashes, it's just not also doing a mapping from hashes to file paths that needs an extra lookup per file.
>>
>>60864014
Neat. Gonna have a booru with only porn that I like.

Can it auto-download artist tags too?
>>
>>60864035
save the new hash as metadeta in the image genius
bwahaha
>>
>>60864035
>>60864019

>Use tool once
>Changes hash of all your images
>Now locked in to that vendor
>>
>>60864048
>new hash
>PTR becomes pointless since the hashes are different
>>
>>60864045
If you let Hydrus download files from a *booru, it'll basically grab all tags if you tell it to. [There is sometimes slightly more fine-grained control, but generally you'll grab all tags.]

The PTR and DB dumps from possibly other boorus can then also augment with their tags.
>>
File: ThreeDimensional 504.jpg (95KB, 770x1024px) Image search: [Google]
ThreeDimensional 504.jpg
95KB, 770x1024px
>>60864064
>>60864057
>>60864048
>>60864035
>>60864019
a large metadeta section that just iterates looking for hash collisions for the original image hash
>>
>>60864057
Yea, and everyone else will hate your users when they upload their files from their filesystem directly since now easy duplicate elimination no longer works.

Basically, even if you want to hack the feature into your fork of hydrus, please add another DB to have that slower hash->file paths indirection. Don't fuck with the images if you don't have to.
>>
>>60864103
ohhh noo the ceo of whatever non-encrypted / autistic indie image host will be mad because I used .0000001% more resources
>>
Not entirely sure of the file structure for all image types, but isnt it possible to generate a hash using only data from the non-metadata section of the file? So that you can rename the file, change metadata, tags, etc but the hash remains the same so long as the image remains the same?
>>
>>60864130
the hash will be calculated using the entire file by whatever other generic service that sees the file; I think it's technically possible to generate a deterministic hash without the metadeta though; because normal jpeg metedata is all at the end of the file by itself
>>
>>60864157
Right, so really... that's how it should be done. I'm sure such a system would have to detect the way the drive is formatted, and the file type, to properly hash using only "non-metadata".
>>
>>60864123
If they can't strip the metadata easily, they'll just disable the file uploads, and either way, you'd be a faggot like the sites who watermark everything.

Even then, you can't add your metadata to all file types that Hydrus supports. But feel free to fork if you really need to try your approach, it's an open sauce project after all.
>>
>>60864157
Enjoy doing it for all 16+ container formats apart from JPEG.

And you're generally just asking to wreck performance anyhow.
>>
File: .png (405KB, 358x500px) Image search: [Google]
.png
405KB, 358x500px
>>60864205
where do these fantasies about 'disabling uploading of unique files' come from lol

im not forking garbage pythongui garbage shit; i already have my own perfect system built

>>60864241
lol calculate the hash before modifying the file and save that hash as a metadeta nerd
all these problems are already solved
there are already cpp libraries that can metadeta just about any image format that supports metadeta; otherwise just convert the image

every booru api supports searching for hash directly anyway, you dont even have to do anything funny
>>
the best way to sort this kind of content is chronologically like how your memories are organised. separate them into folders order by month. this way there should be a manageable amount of folders each with a manageable amount of images and if you have a decent memory you should be able to recall roughly what's in each folder
>>
>>60864357
worst post itt
>>
>>60864403
t. someone with a shit memory.
>>
>>60864274
> i already have my own perfect system built
Uh ... good job I guess?

I obviously don't even see why I should believe that it's anywhere near perfect. You generally seem to make everything slower by requiring a lot of filesystem accesses, and be interested in JPEG only.

> every booru api supports searching for hash directly anyway, you dont even have to do anything funny
No shit, because *boorus and various CDN and big data things tend store and retrieve files by hash "/68/c4/sample_68c416bf307b595173121aad55d829fd.jpg" on gelbooru. Exactly because it's the superior solution.

> otherwise just convert the image
Converting one lossy image format into another is usually such a great idea.

Never mind all the fun you can have converting .swf, .pdf and more.
>>
>>60864357
You don't even need the folders, your file manager can sort files chronologically for you.

But really, this doesn't work very well unless you're not looking at that many files, or have a really fucking good memory.
>>
>>60864502
lol why are my anime images going to .pdfs lmao
>>
So does hydrous in effect know how to parse through 4chan images simply based off the file number and no additional tags? I say this because I have over a hundred twenty thousand images save in a single four terabyte SATA drive that are completely disorganized outside of being listed by the sequential order as saved from
>>
>>60864563
> So does hydrous in effect know how to parse through 4chan images simply based off the file number and no additional tags?
It will generate checksums from the files when you import the files.

Then it will match them with tag databases you enabled (from *boorus, the PTR, whatever) and you should have quite a lot of images that now have tags.

It'll also generate a second set of "checksums" (perceptual hashes) to enable finding duplicate files for various supported file types, but they'll not be used in the file names.

> over a hundred twenty thousand images
Seems considerably less than I'd typically expect on a 4TB drive, are they all high resolution or something?
>>
>>60864617
That's just really my memes and interesting images folder from 4chan the SATA drive just so happens to be 4 terabytes in size.

So the images have to be p*** is that really the caveat in order to make it work? What about all of the the images I have that are strictly not p***and obviously wouldn't show up on you know like most archives with extensive tags. Also I'm using a transcriber have no f****** idea why the phone keeps censoring my language when I curse but I don't think it well sensor specific terms like gook.
>>
>>60864672
lol
>>
>>60864672
> That's just really my memes and interesting images folder from 4chan the SATA drive just so happens to be 4 terabytes in size.
Odd. Most of these should be like 200kb to 1MB or something, so I'd certainly more expect like 120GB than 4TB.

I'm not actually sure what that p*** censored word is. Pasta? Porn? Penises? But anyhow, it's pretty funny what exactly your transcriber censors.

I also can't really tell you if there is particularly *good* tag coverage for your images. But of course you can add your own tags to files. [Whether it's on a booru or on the PTR, most tags were ultimately added manually by someone.]
>>
>>60861864
It is limited by your hard drive of course, but you can try running db maintenance or (preferably) moving hydrus to your ssd (you do have an ssd, right anon?)
>>
I tell you what I do OP. every few months I perform what I call a "clean slate" where I move all pictures from my phone, tablet, and laptop onto dedicated flash drives. the drives then go into a box along with graveyard of drives from previous clean slates.

You would think that you would need to explore these often, but often I find they are forgotten about quite quickly as your fresh devices begin the cycle again with new media just as rapidly
>>
>>60860486
I use Save Image/Link in Folder and save images into set categories. The images go into a unsorted folder within their category so that I can place it into its specific folder later.
Thread posts: 47
Thread images: 6


[Boards: 3 / a / aco / adv / an / asp / b / bant / biz / c / can / cgl / ck / cm / co / cock / d / diy / e / fa / fap / fit / fitlit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mlpol / mo / mtv / mu / n / news / o / out / outsoc / p / po / pol / qa / qst / r / r9k / s / s4s / sci / soc / sp / spa / t / tg / toy / trash / trv / tv / u / v / vg / vint / vip / vp / vr / w / wg / wsg / wsr / x / y] [Search | Top | Home]

I'm aware that Imgur.com will stop allowing adult images since 15th of May. I'm taking actions to backup as much data as possible.
Read more on this topic here - https://archived.moe/talk/thread/1694/


If you need a post removed click on it's [Report] button and follow the instruction.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com.
If you like this website please support us by donating with Bitcoins at 16mKtbZiwW52BLkibtCr8jUg2KVUMTxVQ5
All trademarks and copyrights on this page are owned by their respective parties.
Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
This is a 4chan archive - all of the content originated from that site.
This means that RandomArchive shows their content, archived.
If you need information for a Poster - contact them.