Category Archives: Big Data

Post navigation

The github repository of #NGSchool website has grown to over 5GB. I wanted to reduce the size & simplify this repository, but this task turned out to quite complicated. Instead, I have decided to leave current repo as is (and probably removed it soon) and start new repo for existing version. I could do that, as I don’t care about version earlier than the one I’m currently using. This is short how-to:

Push all changes and remove .git folder

git push origin master
rm -rI .git

Rename existing repo

Settings > Repository name > RENAME

Start new repository using old repo name

Don’t need to create any files as all already exists.

Init your local repo and add new remote

git init
git remote add origin git@github.com:USER/REPO

Commit changes and push

git add --all . && git commit -m "fresh" && git push origin master

Doing so, my new repo size is below 1GB, which is much better compared to 5GB previously.

Lately, I have had lots of problems with pushing large files to github. I am maintaining compilation of materials and software deposited by other people, so cannot control the size of files… and this makes push to fail often.

I have spent quite some time today, trying to add batch of user accounts intro Drupal8. After all it’s not that difficult, all process is limited to installation of PHP module (that was stripped from Drupal8) and creating new page. Make sure, this page is available only to system administrator!

Below code, defines HTML form and PHP code, that reads the input and store users information. Entries are skipped if given user name is already registered.Make sure, this page is saved, but UNPUBLISHED! Otherwise, other user will be able to use it and register user accounts!!!

Git is great, there is no doubt about that. Being able to revert any changes and recover lost data is simply priceless. But recently, I have started to be concerned about the size of some of my repositories. Some, especially those containing changing binary files, were really large!!!
You can check the size of your repository by simple command:

git count-objects -vH

Here, git Large File Storage (LSF) comes into action. Below, I’ll describe how to install and mark large binary files, so they are not uploaded as a whole, but only relevant chunks of changed binary file is uploaded.

Working with millions of intermediate files can be very challenging, especially if you need to store them in distributed / network file system (NFS). This will make listing / navigating the directories to take ages… and removing of these files very time-consuming.
During building metaPhOrs DB, I needed to store some ~7.5 million of intermediate files that were subsequently processed in HPC. Saving these amount of files in the NFS would seriously affect not only myself, but also overall system performance.
One could store files in an archive, but then if you want to retrieve the data you would need to parse rather huge archives (tens-to-hundreds of GB) in order to retrieve rather small portions of data.
I have realised that TAR archives are natively supported in Python and can be indexed (see `tar_indexer`), which provide easy integration into existing code and random-access. If you work with text data, you can even zlib.compress the data stored inside you archives!
Below, I’m providing relevant parts of my code:
BASH

Traceback (most recent call last):
File "src/homologies2mysql_multi.py", line 294, in <module>
main()
File "src/homologies2mysql_multi.py", line 289, in main
o.noupload, o.verbose)
File "src/homologies2mysql_multi.py", line 242, in homologies2mysql
for i, data in enumerate(p.imap_unordered(worker, pairs), 1):
File "/usr/lib64/python2.6/multiprocessing/pool.py", line 520, in next
raise value
ValueError: need more than 1 value to unpack

I could run it without multiprocessing, but then I’d have to wait some days for the program to reach the point where it crashes.
Luckily, Python is equipped with traceback, that allows handy tracing of exceptions.
Then, you can add a decorator to problematic function, that will report nice error message:

If you are (like me) annoyed by providing password at every mysql login, you can skip it. Also it makes easier programmatic access to any MySQL db, as not passwd prompting is necessary 🙂
Create `~/.my.cnf` file:

I had to retrieve data from multiple .xlsx files with multiple sheets. This can be done manually, but it will be rather time-consuming tasks, plus Office quotes text fields, which is not very convenient for downstream analysis…
I have found handy script, xlsx2tsv.py, that does the job, but it reports only one sheet at the time. Thus, I have rewritten xlsx2tsv.py a little to save all sheets from given .xlsx file into separate folder. In addition, multiple .xlsx files can be process at once. My version can be found on github.