qwebirc57: unstable fucking piece of shit
***: dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
qwebirc57 is now known as dd0a13f37
Honno has quit IRC (Read error: Operation timed out)
dd0a13f37: I missed anything?
***: dd0a13f3T has joined #archiveteam-bs
dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
JAA: Nah
***: refeed has joined #archiveteam-bs
dd0a13f3T is now known as dd0a13f37
drumstick has quit IRC (Read error: Operation timed out)
dd0a13f37: If something is on usenet, is it considered archived? And would it be a good idea to upload library genesis torrents to archive.org, or would that be considered wasting space/bandwidth for piracy?
JAA: I've heard that there might be a copy of libgen at IA already (but not publicly available). Not sure if it's true though.
And although Usenet is safe-ish, I wouldn't consider it archived. Stuff still disappears from it sooner or later.
dd0a13f37: You can upload a torrent to IA and have them download it, right?
JAA: Yes, I believe so.
dd0a13f37: Then you could download their zip file of torrents, upload them to archive.org, then wait for them to pull it
But is it worth it? It's 30tb of data, and it will likely be hidden
The databases are archived
https://archive.org/details/libgen-meta-20150824
***: dd0a13f3 has joined #archiveteam-bs
BlueMaxim has joined #archiveteam-bs
JAA: I wouldn't be surprised if either https://archive.org/details/librarygenesis or https://archive.org/details/gen-lib contained a full (hidden) archive.
***: dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
dd0a13f3: Should I avoid uploading it, or will it recognize and deuplicate?
***: dd0a13f3 is now known as dd0a13f37
dd0a13f37: both of these are 3 years old, so they're outdated at any rate
godane: so i'm going thur my web archives that i have not uploaded
or at least thought i uploaded and turned out i didn't
dd0a13f37: Okay, so if I have a url pointing to a zip file of torrents, can I just give them the URL?
No, apparently not. How does this "derive" stuff work, can I have them unpack a zip file for me?
JAA: dd0a13f37: That's when the collection was created, not when any items in the collection were added/last updated.
By the way, the graph for the number of items in the second collection of the two looks interesting...
dd0a13f37: Sure, but who would update such a collection?
JAA: Someone from IA?
dd0a13f37: 2k items is much too small, they have 2m books. Or is it the amount of folders?
JAA: An item can hold an arbitrary number of directories and files (more or less, there seem to be some issues if the items get very large).
If they have a copy, they certainly wouldn't throw it all into one item, and they also certainly wouldn't throw each book/article into its own item.
dd0a13f37: The torrents are folders named XXXX000, where XXXX is the unique identified (from 0-2092)
JAA: Well, then 2k sounds about right?
dd0a13f37: So that could mean there are 2k different oflders
Yeah
Although, looking at the graph it seems more like 1.4k, or is it log?
-: JAA shrugs
JAA: Looks like it might be rounded, so the top of the graph is 1.5k.
godane: i'm reuploading my images.g4tv.com dumps
dd0a13f37: Should I upload them again then?
They're also missing sci-mag, which is around 50tb
JAA: Definitely ask IA about this first.
But I doubt that that dataset is going to disappear anytime soon.
There are certainly several copies stored in various places.
(Including the ones publicly available via Usenet or torrents.
)
dd0a13f37: Yes, that's true. The torrents are seeded, and various mirrors have more or less complete copies.
godane: looks like i upload them nevermind
dd0a13f37: Sci-mag is worse off, but on the other hand they have sci-hub which has multiple servers run by people who are not subject to any jurisdiction
So both collections should be fine
***: drumstick has joined #archiveteam-bs
VADemon_ has quit IRC (left4dead)
hook54321: Should I check if a piece of software is already on archive.org before going through all my CDs?
dd0a13f37: To upload or to download?
If they're somehow part of a collection then it might not be such a huge deal
hook54321: What do you mean?
dd0a13f37: If you have some collection of software on 10 different disks that you bought as a bundle then it might have historical value as a whole even if all the software exists separately
hook54321: it's mostly single disks, bought separately.
dd0a13f37: Well, it can't be that much storage wasted even if you do upload it twice
could be different versions as well
hook54321: If it has a different cover then I would definitely upload it
***: drumstick has quit IRC (Read error: Operation timed out)
drumstick has joined #archiveteam-bs
hook54321: arkiver: I left the channel
***: Sk1d has quit IRC (Ping timeout: 194 seconds)
Sk1d has joined #archiveteam-bs
refeed has quit IRC (Ping timeout: 600 seconds)
pizzaiolo has quit IRC (Quit: pizzaiolo)
refeed has joined #archiveteam-bs
icedice has quit IRC (Quit: Leaving)
Dimtree has quit IRC (Read error: Operation timed out)
hook54321: Did we grab all the duckduckgo stuff?
***: Dimtree has joined #archiveteam-bs
Soni has quit IRC (Ping timeout: 272 seconds)
Stilett0 has joined #archiveteam-bs
DFJustin has quit IRC (Remote host closed the connection)
DFJustin has joined #archiveteam-bs
swebb sets mode: +o DFJustin
Asparagir has quit IRC (Asparagir)
kristian_ has joined #archiveteam-bs
Honno has joined #archiveteam-bs
kristian_ has quit IRC (Quit: Leaving)
schbirid has joined #archiveteam-bs
refeed has quit IRC (Read error: Operation timed out)
tuluu has quit IRC (Read error: Operation timed out)
underscor has joined #archiveteam-bs
swebb sets mode: +o underscor
tuluu has joined #archiveteam-bs
BartoCH has joined #archiveteam-bs
zhongfu_ has quit IRC (Ping timeout: 260 seconds)
zhongfu has joined #archiveteam-bs
Mateon1 has quit IRC (Read error: Operation timed out)
Mateon1 has joined #archiveteam-bs
noirscape has joined #archiveteam-bs
BlueMaxim has quit IRC (Quit: Leaving)
drumstick has quit IRC (Read error: Operation timed out)
joepie91_: hook54321: definitely upload it; if it turns out to be a duplicate it can always be removed later
hook54321: there are often many different editions of the same thing
***: Soni has joined #archiveteam-bs
pizzaiolo has joined #archiveteam-bs
tuluu_ has joined #archiveteam-bs
tuluu has quit IRC (Read error: Operation timed out)
dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
JAA: http://www.instructables.com/id/How-to-fix-a-Samsung-external-m3-hard-drive-in-und/ :-)
***: wp494 has quit IRC (Read error: Connection reset by peer)
wp494 has joined #archiveteam-bs
schbirid has quit IRC (Quit: Leaving)
etudier has joined #archiveteam-bs
Stilett0 has quit IRC (Read error: Operation timed out)
etudier has quit IRC (Remote host closed the connection)
second: They say archive.org did a faulty job of archiving something, but they have the new forums up, can you guys archive their backup? http://gamehacking.org/ Scroll down to news for Aug 10th
Or I can archive it but where do I upload it to get it into the archive and what is the proper way to do so?
JAA: second: Is GameHacking itself also in danger, or is this just about the WiiRd forum archive?
Whatever. GH isn't that big anyway. I'll throw it into ArchiveBot.
Scratch the "not that big", but it's worth archiving the entire thing. Looks like it has tons of useful resources.
***: mls has quit IRC (Read error: Connection reset by peer)
mls has joined #archiveteam-bs
second: JAA: just the WiiRd forum
JAA: you're going to have a hard time archiving the gamehacking parts though
Lots of javascript on the page, I was doing it but chrome headless crashed with the setup I was using in docker w/ warcproxy
I'll redo it when I get some time and hopefully when firefox headless comes out
I have a juypter notebook with the code for doing it
going through each page of the manuals and clicking expand
If you can archive the other stuff / whatever you can that would be great because I'm only going for the cheat codes
Very useful for emulators / games old and new
There are some games which are pretty much unplayable without cheat codes because they required certain hardware things
Think pokemon trading to evolve or Django the Solar boy requiring the litteral sun
JAA: Hm, I haven't found anything that didn't work for me without JavaScript yet.
Do you have an example?
second: http://gamehacking.org/game/4366
Click the down arrows on the side
JAA: Ah yeah, just saw that now.
second: They require javascript and outputs the codes for each cheat device
Even includes notes
Its too bad archivebot can't accept javascript to run on each page or something like selenium commands but archivebot doesn't even work like that from what I gather
Its more like a distributed wget
perhaps one day it can be upgrade to a very lite and small browser or even a proxy that a archive browser uses to hit pages
Still a partial archive is better than no archive
JAA: is there an archive of allrecipes?
And are you adding gamehacking.org to the archive?
JAA: ArchiveBot does have PhantomJS, but that doesn't work too well and wouldn't help in this case at all.
Or to be precise, wpull supports PhantomJS, and ArchiveBot uses wpull internally.
second: wpull hasn't been updated in the longest!
And isn't taking pull requests either
JAA: But that's just for scrolling and loading scripted stuff. It doesn't work for clicking on things etc.
Yes, I know. chfoo's been pretty busy, from what I gathered.
second: Is there a more updated version and does it work with youtube-dl now / still?
hmm they are actually in here
JAA: I know that youtube-dl is broken on at least most pipelines.
second: They could try giving permissions for others to merge code in or push to the project
JAA: No idea if it works when used directly with wpull.
There's the fork by FalconK, which has a few bug fixes, but other than that I'm not aware of anyone working on it.
I've been working on URL priorisation for a while now, but I haven't spent much time on it really.
FalconK's also pretty busy currently, so yeah, nobody's even trying to maintain it.
second: URL priorisation?
What is everyone busy with?
Is there a good way to save wikia websites?
So I have a lot of questions, its not often I'm on efnet (maybe I'll fix that) and I've been interested in archiving for a long time
JAA: https://gist.github.com/JustAnotherArchivist/b82f7848e3c14eaf7717b9bd3ff8321a
This is what I wrote a while ago about my plans.
It's semi-implemented, but there's still some stuff to do, in particular there is no plugin interface yet, which is necessary to then implement it into ArchiveBot (and grab-site).
People are busy with real-life stuff, I guess.
Wikia's just Mediawiki, isn't it? There are two ways to save that, either through WikiTeam (no idea how active that is) or through ArchiveBot.
second: Can the archivebot archive a flakey site which requires login?
JAA: And regarding your earlier questions: there is no record of an archive of allrecipes in ArchiveBot; someone shared a dump in here a few months ago, but that's not a proper archive and can't be included in the Wayback Machine.
Yes, I added gamehacking.org to ArchiveBot.
second: Yeah, I found that one
JAA: No, login isn't supported by ArchiveBot.
Neither is CloudFlare DDoS protection and stuff like that, by the way.
second: dang, did not know about cloudflare
Why not cloudflare?
That is a lot of sites we can't archive then
JAA: Just the DDoS protection bit, i.e. the "Checking your browser" message thingy.
That requires you to solve a JS challenge...
There was some discussion on this in here a few days ago.
second: https://github.com/ArchiveTeam/ArchiveBot/issues/216
JAA: Yes, but cloudflare-scrape is a really shitty and insecure solution.
second: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2017-09-14,Thu&sel=124-150#l120
***: brayden has quit IRC (Read error: Connection reset by peer)
brayden has joined #archiveteam-bs
swebb sets mode: +o brayden
cf has quit IRC (Ping timeout: 260 seconds)
cf has joined #archiveteam-bs
etudier has joined #archiveteam-bs
Stilett0- has joined #archiveteam-bs
Stilett0- is now known as Stiletto
chfoo: i haven't been feeling like maintaining wpull unfortunately :/ it became a big ball of code
***: kristian_ has joined #archiveteam-bs
dd0a13f37 has joined #archiveteam-bs
dd0a13f37: JAA: cloudflare whitelists tor using some strange voodoo magic (it's not just the user agent and it works without JS), can we utilize this somehow?
Or, well, it depends on the protection level, but for 90% you can browse Tor. It didn't use to be this way, and if you do "copy as curl" from dev tools and paste into terminal w/ torsocks you still get the warning page
JAA: dd0a13f37: Interesting. If we knew more about it, we could perhaps use it, yes. I wonder how reliable it is though.
dd0a13f37: It could be details in SSL is handled
That seems like the only difference I can think of
JAA: That would be painful to replicate.
***: balrog has quit IRC (Ping timeout: 1208 seconds)
JAA: I guess implementing joepie91_'s code in a wpull plugin is probably easier.
dd0a13f37: Even if you do "new circuit for this site" and issue the request with a cookie that shouldn't be valid for that IP it still works
JAA: How do you get that cookie initially?
dd0a13f37: Can't you just add a hook to get a valid cookie without changing any structure?
The site sets it
JAA: Hm
dd0a13f37: You get a __cfduid cookie
when connecting to a cf site
JAA: So the normal procedure, right.
dd0a13f37: Are those tied to IPs?
JAA: Yeah, you could implement it as a hook, but the problem is that there is no proper implementation of a bypass.
dd0a13f37: Because if I copy the exact request and issue it with curl (same cookies, headers, ua) using torsocks it doesn't work
That's the spooky thing
What do you want to bypass? "one more step" or "please turn on js"?
JAA: "Checking your browser"
dd0a13f37: Isn't there?
JAA: Which is "please turn on JavaScript" if you have JS disabled.
Not as far as I know.
dd0a13f37: So what does joepie91's code do?
***: balrog has joined #archiveteam-bs
swebb sets mode: +o balrog
JAA: It parses the challenge and calculates the correct response without executing JavaScript.
dd0a13f37: Isn't that a bypass?
Or what exactly are you looking to do?
JAA: Yes, it is.
But it's written in JavaScript, not in Python.
https://gist.github.com/joepie91/c5949279cd52ce5cb646d7bd03c3ea36
dd0a13f37: Modify it so it prints the cookie to stdout, then just do shell exec
easy solution
JAA: Yeah, we'd like a pure-Python version so we can avoid installing NodeJS or equivalent.
I mean, it might work on ArchiveBot where we have PhantomJS anyway, but it'd also be nice to have it in the warrior, for example.
dd0a13f37: Can't you set it up as a web service? Send challenge page-get response
You only need to do it once
JAA: Huh, that's a nice idea actually.
A CF protection cracker API :-)
dd0a13f37: """protection"""
"""cracker"""
JAA: Hehe
dd0a13f37: And what about https://github.com/Anorov/cloudflare-scrape ?
JAA: That executes CF's code in NodeJS and is inherently insecure.
dd0a13f37: So it needs node?
JAA: You can easily trick it into executing arbitrary code, i.e. use it for RCE.
Yep
dd0a13f37: Oh ok
So how does the script work, does it take an entire page and return a cookie?
JAA: Which script?
dd0a13f37: https://gist.github.com/joepie91/c5949279cd52ce5cb646d7bd03c3ea36
JAA: I'm not sure. I've never used it, and I'm not familiar with using JavaScript like that (i.e. outside of a browser) at all.
dd0a13f37: Me neither
What is executed first? Or is it like a library, so you should look at the exports?
JAA: As far as I can tell, the function in index.js takes the challenge site as an HTML string as the argument and throws out the relevant parts of the JS challenge that you need to combine somehow to get the response.
The challenge looks like this, in case you're not familiar with it:
fVbMmUH={"twaBkDiNOR":+((!+[]+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]))};
fVbMmUH.twaBkDiNOR-=+((+!![]+[])+(!+[]+!![]+!![]+!![]+!![]+!![]+!![]+!![]+!![]));fVbMmUH.twaBkDiNOR*=+((!+[]+!![]+!![]+[])+...
So you need to transform each of those JSFuck-like expressions into a number and then -=, *=, etc. those numbers to get the correct response.
dd0a13f37: Can't you just use a regex to sanitize it and then execute them unsafely?
JAA: Hahaha, good luck sanitising JSFuck.
I think cloudflare-scrape tries, but yeah...
dd0a13f37: Oh, it can execute code, not just return a value?
well then you're fucked
JAA: Yeah. The code would be huge, but you can write *any* JS script with just the six characters ()[]+! used in the challenge.
https://en.wikipedia.org/wiki/JSFuck
dd0a13f37: Was that an actual example or just randomly generated?
JAA: That's an actual example.
dd0a13f37: Where can I find one?
A complete one
JAA: https://gist.github.com/anonymous/85c9b2b57726135a2500a8425b370095
dd0a13f37: I don't understand the purpose
Anyone who wants to do evil stuff would just use one of those scripts, and they're using a botnet so they wouln't care about cloudflare infecting them
What's the point?
JAA: Idk either
***: etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
dd0a13f37: I don't get it, why can't you just use proxies for the really unfriendly sites?
***: Asparagir has joined #archiveteam-bs
JAA: And by the way, it's not just about CloudFlare serving evil code. Anyone could easily trigger cloudflare-scrape from their own server with an appropriate response.
***: svchfoo3 sets mode: +o Asparagir
svchfoo1 sets mode: +o Asparagir
dd0a13f37: Well, I doubt you care about ACE when running a botnet
JAA: Specifically: https://github.com/Anorov/cloudflare-scrape/blob/ee17a7a145990d6975de0be8d8bf5b0abbd87162/cfscrape/__init__.py#L41-L47
Yeah, I just mean in general.
dd0a13f37: There are commercial proxy providers with clean IPs, the cost of renting a bunch would probably be cheaper than what you spend on hard drives
Got another response from itorrents, he said he would upload database to archive.org and send link, the other three still haven't responded
JAA: Looking at generated jsfuck code, it's usually very long
CF is quite short
so you should be able to use a regex and limit the length
for example encoding the character a is 846 chars encoded
http://www.jsfuck.com/
And CF's brackets are always empty - [], jsfuck needs to have something inside to eval
JAA: Yeah, I'm aware of that. It's still sloppy though.
dd0a13f37: It should be safe though
JAA: I don't think you strictly need something inside the brackets to do things in JSFuck, but it probably helps shorten the obfuscated code.
dd0a13f37: You can never get the eval() you need to do bad things
It shouldn't be turing complete
JAA: Possible
I don't really know enough about JSFuck to say for sure.
***: arkhive has joined #archiveteam-bs
dd0a13f37: https://esolangs.org/wiki/JSFuck
it needs a big blob which is not possible to encode in under a certain amount of characters, it's ugly as fuck but it should be safe
the eval blob is 831 characters, so if you set an upper limit at 200 you should be fine
***: etudier has joined #archiveteam-bs
etudier has quit IRC (Client Quit)
dd0a13f37 has quit IRC (Ping timeout: 268 seconds)
mundus: What's the best tool for large site archival?
***: arkhive has quit IRC (Quit: My iMac has gone to sleep. ZZZzzz…)
JAA: mundus: Define "large"?
mundus: like a million pages
JAA: wpull can handle that easily, assuming you have sufficient disk space.
mundus: Okay
I was guessing wpull
JAA: Not sure if it's the "best" tool, but it works well.
I've ran multi-million URL archivals with wpull several times.
mundus: alright, what options do you normally use?
JAA: I think I mostly copied those used in ArchiveBot, then adapted them a bit in some cases.
https://github.com/ArchiveTeam/ArchiveBot/blob/a6e6da8ba37e733e4b10b7090b5fc4a6cffc9119/pipeline/archivebot/seesaw/wpull.py#L18-L53
mundus: cool, thanks
joepie91_: mundus: you may find grab-site useful also
sort of like a local archivebot
mundus: ref https://github.com/ludios/grab-site
mundus: oh nice
second: chfoo: do you have a doc explaining how wpull works with youtube-dl etc or how it should work?
How do I become a member of the ArchiveTeam and what would that mean?
JAA: is there a doc somewhere with how the IA archives things and keeps bacups?
backups
***: etudier has joined #archiveteam-bs
BartoCH has quit IRC (Ping timeout: 260 seconds)
JAA: second: You become a member by doing stuff that aligns with AT's activities. There isn't anything formal.
There is some stuff in the "help" section of archive.org, and also some blog entries. Not sure what else exists.
I don't think the individual archival strategies etc. are documented well (publicly) though.
***: BartoCH has joined #archiveteam-bs
jrwr: second: anyone can do /something/ we are more of a method then anything, what do you want to do?
***: kristian_ has quit IRC (Remote host closed the connection)
second: not sure, I'm more working on file categorization / curation right now
What kind of things shouldn't we archive?
jrwr: Well
Thats a hard question
If you are doing web archival, I would make sure to save everything as WARCs
(wget supports this, so does wpull)
Anything else, just do best quality you can. the more metadata the better
make an account on IA and go to town uploading things
check out SketchCow's IA and see how he uploads things
(for things like CDs, Tapes, Paper)
***: DFJustin has quit IRC (Remote host closed the connection)
DFJustin has joined #archiveteam-bs
swebb sets mode: +o DFJustin
ZexaronS has quit IRC (Quit: Leaving)
drumstick has joined #archiveteam-bs
Honno has quit IRC (Read error: Operation timed out)
Soni has quit IRC (Ping timeout: 506 seconds)
Soni has joined #archiveteam-bs
second: Does the internet archive have deduplication active?
I wouldn't want to upload a bunch of stuff and waste their space
***: ZexaronS has joined #archiveteam-bs
second: JAA: has this been archived? https://www.reddit.com/r/opendirectories/comments/6zuk7v/alexandria_library_38029_ebooks_from_5268_author/
https://alexandria-library.space/Ebooks/Author/
https://alexandria-library.space/Ebooks/ComputerScience/
https://alexandria-library.space/Images/ww2/north-american-aviation-world-war-2/
https://alexandria-library.space/Images/
JAA: Not yet, as far as I know, but arkiver just added them to ArchiveBot.
arkiver: yeah
***: BartoCH has quit IRC (Quit: WeeChat 1.9)
second: Did you do it because I said something or was it already added? I'm wondering if you guys watch that and other reddit(s)
Is there an archive of scihub?
JAA: I watch some subreddits, but not opendirectories (yet).
arkiver: added because you said it
it looks like something we want to archive
JAA: We were discussing libgen several times in the past few days. See the logs: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2017-09-17,Sun
Basically, at this point, I assume that IA has a darked copy of it, and even if they don't, the dataset won't disappear anytime soon and can still be archived *if* libgen actually gets in trouble.
second: Isn't libgen always possibly in trouble?
Different governments / institutions trying to shut it down
JAA are you Jason Scott?
JAA: Possible, but I wouldn't be worried about the data until libgen actually goes offline or similar.
The data is available in (active) torrents and on Usenet...
No, that's SketchCow.
second: How does one setup a Usenet account / get one, is there a guide somewhere?
JAA: First rule of Usenet...
second: Dammit
JAA: :-P
Check out /r/usenet. They have a ton of good information.
second: Will you guys archive porn?
JAA: Well, we did archive Eroshare, so there's that.
***: Soni has quit IRC (Read error: Connection reset by peer)
JAA: There's also that 2 PB webcam archive by /u/Beaston02.
second: Eh, I found a wiki which list actors in porn but you need to login
JAA: That's not on IA though.
second: Can you archive it?
Why not?
All this stuff on the IA and the most viewed stuff in the art museum is vintage porn
http://95.31.3.127/pbc/Main_Page
JAA: Well, I don't think IA is interested in spending 3-4 million dollars over the next few years for random porn webcams.
(That number is based on https://twitter.com/textfiles/status/885527796583284741 )
second: How do people archive 2PB of data?!
JAA: I'm not saying it shouldn't be archived. In general, my opinion is that everything should be kept. Unfortunately though, that's not very realistic, and I think there are more important things to preserve than random porn webcams.
Amazon Cloud Drive and now Google Drive.
second: Wait a minute, Jason Scott is the same guy behind textfiles.com, interesting
JAA: Some people suspect that ACD only killed the unlimited offer because of Beaston02 storing those webcam recordings there.
second: JAA: are there any upcoming store breakthroughs that you can think of?
Lol, "this is why we can't have nice things"
***: ld1 has quit IRC (Read error: Connection reset by peer)
JAA: No idea really. HAMR will come, but that probably won't really reduce storage costs massively, i.e. not a real breakthrough. DNA storage is still far away, I guess. Otherwise, I don't really know too much about other technologies currently in development.
***: ld1 has joined #archiveteam-bs
etudier has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
etudier has joined #archiveteam-bs
jrwr: I think DNA might be a good ROM
not WMRM
or like old school tape drives
JAA: Yeah, it sounds pretty perfect for long-term archival.