Massive UFO disclosure in USA : A challenge for ATS

Originally posted by idealord
Writing parsers for inconsistent data is impossible or at best problematic. Plus when you add characters that are meaningless in the middle of them -
those >> characters aren't ASCII, they're who knows what, then it's just a mess. I was lucky to be able to separate city and state as often as I
did.

I'm afraid at some point it'll come down to volunteers hand-fixing the data (which is why I kept the processed raw on the side).

Thanks Jeff.

I think I understand (most of your comments anyway...). Of course, I'd like the automated process to deal with as much of the tidying up as
possible, but I accept there will be limits. However, while you would probably be horrified at my approach, I think I could use Microsoft Word (which
I'm more at home with than any script or code...) to automate some more of the tidying up if I import your spreadsheet as a table in Word and then
record a couple of basic macros. This would not solve all the issues, but should be able to use the raw information in your final column to fill in
quite a few of the gaps in the other columns in the spreadsheet you are kindly creating.

I have an idea as to how UFO's could be refueling in our atmosphere. Recently, a ring of anti-matter was discovered this week around the Earth by the
PAMELA science instrument on board a Russian Satellite in a near polar orbit. (Polar orbits are typically used for stealth satellites - yes, those
exist).

Please, check out my blog for my idea and the reference to the article which contains a reference to the technical paper. novelfindings.blogspot.com...

Not horrified at all! In fact, that's why I'm saving Raw. As I explained earlier, I have a second database that I haven't put up yet which just
has the ID's and the Unprocessed Raw Data, thinking I might come up with a better approach at a later date.

I'm up to 84801 records right now, FWIW... I'm starting to think before I get back to the Canadian data, I might get back to downloading the
low-res docs. It's probably a good thing to preserve and with this data, I could make a huge PDF that we could slowly get better at organizing with
a little tweaking.

I was thinking it'd be nice to have all the incidents page by page, browsable. So everything from the Long Island incident of 1963, would be next to
each other.

About the Canadian stuff, is it true the database only lists 2000 pages? If so, my spidering approach is very limited. How did you guys confirm
that?

Google wouldn't let me have a spreadsheet over 50,000 cells, EditGrid wouldn't let me upload a spreadsheet over 2MB, so here's an XLS file from my
site zipped.

I fixed ALL the parsing errors by taking the raw data and replacing the bad binary characters with | symbols and then loading it into a spreadsheet
program. The file also contains the raw data. I've lost the City/State separation, but because the original data was already non-normalized (read
loooooose) we have complete consistency now. It's saved as Excel 2007 because Excel XP couldn't handle over 65,000 rows. I can save it out as
pretty much anything now...

Don't be so quick to declare a complete victory, although I think you are pretty darn close in terms of extracting information...

Apart from the small matter of also downloading all the images - okay, not so small (although hopefully Xtraeme is fairly close to sorting this
problem out), I have a few relatively minor issues with your draft spreadsheet.

In terms of Xtraeme's approach to downloading the images themselves and the fact that he
very recently mentioned he was working on extracting the metadata
associated with each image, I presume that your work basically solves that extraction issue. I'm not sure that your explanation of your approach is
sufficient to save Xtraeme some effort on this problem and duplicate the extraction process. (It's certainly not enough for me to duplicate your work
but, hey, you'd have to write an explanation that was the length of at least one book - if not more - for me to be able to understand the issues
sufficiently to follow a more detailed technical explanation). I wonder, however, if you could send Xtraeme a U2U to check if he is still working on
the extraction problem or post in this thread any more technical details that will save him any time and effort (if those details would not be
sufficiently clear to Xtraeme from your summary so far).

I'd simply like to ensure Xtraeme doesn't waste any time reinventing the wheel when completing his work...

Google wouldn't let me have a spreadsheet over 50,000 cells, EditGrid wouldn't let me upload a spreadsheet over 2MB, so here's an XLS file from my
site zipped.

I fixed ALL the parsing errors by taking the raw data and replacing the bad binary characters with | symbols and then loading it into a spreadsheet
program. The file also contains the raw data. I've lost the City/State separation, but because the original data was already non-normalized (read
loooooose) we have complete consistency now. It's saved as Excel 2007 because Excel XP couldn't handle over 65,000 rows. I can save it out as
pretty much anything now...

First of all - WOW!, what an improvement over the first draft spreadsheet (and - in some respects, e.g. an indication of the number of pages held in
relation to each incident - over any index to the Project Bluebook files I've ever seen before!).

I only have access to Excel 2002 tonight and can only see just over 65,000 rows. I'll run your spreadsheet on Excel 2007 on another machine
tomorrow. I presume that on Excel 2007 I'll see one row for each row (i.e. about 129,000 images/rows).

I'm not sure how many people have seen the potential of your spreadsheet yet - particularly when combined with one or two other UFO databases and
tools. I may have to create a thread or two using your spreadsheet in combination with those other databases just to show what can now be done. That
raises the issue of how to credit your work when I refer to your spreadsheet here and elsewhere - would you prefer I give your ATS username or your
real name (in which case perhaps you could let me know your surname here or in a U2U)?

A couple of (relatively minor) issues, which I can probably resolve myself:

(1) At the moment, some of the images appear to have multiple rows in the spreadsheet (although I don't know yet whether this is simply an effect of
my using Excel 2002 tonight). If I sort by Column A, I can see the same image number (with the same content in the remaining columns) commonly
appears 3 or 4 times. In terms of the statistical problems I'd like to try to address using the spreadsheet, I'd need to eliminate that duplication
(assuming that this issue remains when I use Excel 2007 tomorrow).

(2) Column E (which states the page number within the relevant file on an incident) does not appear to sort properly (again, using Excel 2002
tonight). For example, between "Page 1" and "Page 2" appears "Page 11", Page "12" etc etc.

Dear IssacKoi,
Thanks very much for drawing my attention to this footnotes resource. I had no idea it existed (embaressingly) and will enjoy going through it. Theres
so much detail and time and effort put into all these documents by government officials which says something in itself.

The best thing about so much information being released is there is bound to be information there which they didn't mean to release via human error.

Its a shame some sort of wiki-project and time lines and direct links can't be set up with each document making it all much more coherent and
accessible.

Just want to give a quick update. I've got some bugs that I'm still trying to work out of the system, but progress is good. I also spent a couple
hours yesterday trying to figure out how they were generating the hash, but so far haven't figured it out. So I went back to the macro approach

.
Hopefully by tomorrow or the next day I'll have it finalized (I typically let it run during the evening to find out when and if the program breaks).
The app's got a simple interface. Just a basic windows application with a few menu buttons to start, stop, pause, save, load, and keep track of where
it is in the extraction. All the information about what it's doing is piped to a file that's then 'tail -f'ed to a console.

I've also got it to a state where I can fish out the _javascript metadata on the fly, but as a backup I'm going to use Jeff's database to double-check
that the data is correct. If the _javascript call fails I'll use the cvs as a the final determiner of what the name will be. And that's about it at
the moment!

Hey Xtraeme, FWIW, my meta-data is not what you get on the side-panel of the fancy Flash Footnote viewer. It's just what appears on the search output
window. You've got the annotations and a lot of other cool data accessible there. I looked at the Hash thing too. Did you decompile the Flash? I
can try that if you haven't. Is there a Hashing function in the JS? I haven't downloaded all of it yet. They don't seem to be intentionally hiding
stuff; the Hashing seems to be to make you have to join Footnote to be able to download.

As far as redundant data goes, that's probably my fault. I have had wonky wifi connections yesterday and I had to restart my spider about 5 times
when I lost connectivity. It writes to the file constantly. I'm sure I didn't miss anything, but I did suspect that it would produce a few extra
rows. The area around the 120000+ range was particularly painful! As far as the sorting stuff, that's a typical sorting error because it's an
alphabetical sort. Make sure you're sorting numerically - of course you'll have to get rid of the Page word. I can get rid of that easily with a
search and replace and it should sort numerically.

As far as the dupes, I can run it again - heh - and produce new redundancies...

sometime.

Do you guys want the 1024x1024 images?

Issac, you can credit the database to my real name (which many people know already) - Jeff Harrington.

Originally posted by Xtraeme
Just want to give a quick update. I've got some bugs that I'm still trying to work out of the system, but progress is good.

Hi Xtraeme,

Thanks for the update. It sounds like things are going well.

As I've said before, let me know if there is anything you need help with. (As I've also said before, my own technical abilities are limited so my
main support will be creating another thread and inviting people to help you with anything that is proving time-consuming - as I did in the recent
Canadian UFO thread).

Originally posted by idealord
As far as the sorting stuff, that's a typical sorting error because it's an alphabetical sort. Make sure you're sorting numerically - of course
you'll have to get rid of the Page word. I can get rid of that easily with a search and replace and it should sort numerically.

Hi Jeff,

No problem, even I can use the find and replace command in Excel (or Word...) to find "Page " and replace with nothing.

As far as the dupes, I can run it again - heh - and produce new redundancies...

sometime.

I wouldn't expect you to run it again. I'm sure someone will be able to tell me a way to eliminate duplicate lines in Excel. (I'll post a brief query
on a relevant part of ATS - I think there is a computer or technical sub-forum...).

Do you guys want the 1024x1024 images?

Mmm. Let's wait a bit and see whether Xtraeme gets his method working to download the higher definition issues.

Issac, you can credit the database to my real name (which many people know already) - Jeff Harrington.

In relation to the question that you posted on my profile as to whether this would be a suitable document with which to start your own thread, I think
that would be fine - particularly if you include your reasons for finding it interesting.

I'd suggest resizing the image of this document that you included in your previous post, since it did not fit on the screen without scrolling
across.

It's saved as Excel 2007 because Excel XP couldn't handle over 65,000 rows. I can save it out as pretty much anything now...

Hi Jeff,

I'm looking at your spreadsheet again now on a machine running Excel 2007 - but I still only see about 65,000 rows (65,536 to be precise, which -
from memory - is the same number I saw last night on a machine running Excel 2002).

How many rows should I be seeing?

(I'm assuming there should be one row for each page in the Project Bluebook collection on Footnote.com, i.e. just over 129,000).

I just loaded it into Excel Starter and it shows 124,741 rows. So it looks like Excel 2010 can handle it. In that zip file there's the raw data; you
should be able to load that into Excel and tell it do delimit on the pipe - | - character. (Just that upward straight line). It's basically a CSV
file, but you really can't comma-delimit spreadsheets when you have commas in the data, obviously...

Let me know if that doesn't work for you. I can split it into pieces if you want, also...

Originally posted by idealord
Problem is you won't be able to globally sort them if you have 2 different spreadsheets and that'll limit your understanding of the pagination...
what are you going to combine them into?

Good point.

I'll get hold of something that can handle the bigger spreadsheet and will then download your zip file again. I'll get back to you on this soon.

It's saved as Excel 2007 because Excel XP couldn't handle over 65,000 rows. I can save it out as pretty much anything now...

Hi Jeff,

I've been looking into ways to open the file in some version of Excel.

I've open the txt file (in Notepad) and seen the mass of data and opened it in Excel using the pipe character to delimit the columns. The good news
is that I can now see the larger number of rows of data. The bad news is that this method seems to result in one column for the record number and just
one column for all the remaining fields. In other words, the month, year, location and page number are not in separate columns. Looking at the txt
file (sample below), I can't see the pipe used as a delimiting character except after the record number:

Alternatively, Excel 2007 can handle the larger number of rows - as long as the file is in the right format. I've watched the video below on
converting files from the old xls format (used in your zip file) to the new Excel 2007 xlsx format, but the data remains at about 65,000 rows. I
wonder if others have the same problem or whether you could save the spreadsheet in the xlsx format instead.

I could download the software you mention, but I'm struggling to get to grips with Excel so I'd rather avoid trying to pick up another piece of
sofware...

Originally posted by IsaacKoi
I've watched the video below on converting files from the old xls format (used in your zip file) to the new Excel 2007 xlsx format, but the data
remains at about 65,000 rows.

If the file is a "xls" then there's no way he can have more than 65,000 rows.

I think the best way would be to save the file as "Tab delimited text", that format is used by all spreadsheet and database programs (and it's very
easy to use in scripts or home-made programs).

For the purposes of the scraper I only need to know enough information to figure out what to name the directory. So the script drops all the files
that don't indicate "page 1." Scanning to the bottom of the newly generated csv would seem to indicate that there are a total of 10,464 documents.
However I've noticed a few of the reports like Tom's River, New Jersey sighting aren't in the list. I don't want to duplicate your work. So would
you be opposed to sharing the script you used to harvest the data? I just want to make sure we're not missing any other files. Especially since I'll
be using this to ultimately determine the directory names.

This content community relies on user-generated content from our member contributors. The opinions of our members are not those of site ownership who maintains strict editorial agnosticism and simply provides a collaborative venue for free expression.