Academic Commons Search Resultshttp://academiccommons.columbia.edu/catalog.rss?f%5Bdepartment_facet%5D%5B%5D=Computer+Science&f%5Borganization_facet%5D%5B%5D=Columbia+University&q=&rows=500&sort=record_creation_date+desc
Academic Commons Search Resultsen-usMulti-Persona Mobile Computinghttp://academiccommons.columbia.edu/catalog/ac:182958
Andrus, Jeremy Christianhttp://dx.doi.org/10.7916/D8TB15QSThu, 05 Feb 2015 00:00:00 +0000Smartphones and tablets are increasingly ubiquitous, and many users rely on multiple mobile devices to accommodate work, personal, and geographic mobility needs. Pervasive access to always-on mobile computing has created new security and privacy concerns for mobile devices that often force users to carry multiple devices to meet those needs. The volume and popularity of mobile devices has commingled hardware and software design, and created tightly vertically integrated platforms that lock users into a single, vendor controlled ecosystem. My thesis is that lightweight mechanisms can be added to commodity operating systems to enable multiple virtual phones or tablets to run at the same time on a physical smartphone or tablet device, and to enable apps from multiple mobile platforms, such as iOS and Android, to run together on the same physical device, all while maintaining the low-latency and responsiveness expected of modern mobile devices. This dissertation presents two lightweight operating systems mechanisms, virtualization and binary compatibility, that enable multi-persona mobile computing. First, we present Cells, a mobile virtualization architecture enabling multiple virtual phones, or personas, to run simultaneously on the same physical cellphone in a secure and isolated manner. Cells introduces device namespaces that allow apps to run in a virtualized environment while still leveraging native devices such as GPUs to provide accelerated graphics. Second, we present Cycada, an operating system compatibility architecture that runs applications built for different mobile ecosystems, iOS and Android, together on a single Android device. Cycada introduces kernel-level code adaptation and diplomats to simplify binary compatibility support by reusing existing operating system code and unmodified frameworks and libraries. Both Cells and Cycada have been implemented in Android, and can run multiple Android virtual phones, and a mix of iOS and Android apps on the same device with good performance. Because mobile computing has become increasingly important, we also present a new way to teach operating systems in a mobile-centric way that incorporates the concepts of geographic mobility, sensor data acquisition, and resource-constrained design considerations.Computer science, Computer engineeringjca2119Computer ScienceDissertationsDisCo: Displays that Communicatehttp://academiccommons.columbia.edu/catalog/ac:181983
Jo, Kensei; Gupta, Mohit; Nayar, Shree K.http://dx.doi.org/10.7916/D8RV0MG6Fri, 30 Jan 2015 00:00:00 +0000We present DisCo, a novel display-camera communication system that enables displays to send short messages to digital sensors, while simultaneously displaying images for human consumption. Existing display-camera communication methods are largely based on spatial-domain steganography, where the information is encoded as an imperceptible spatial signal (e.g., QR-code). These methods, while simple to implement, are prone to errors due to common causes of image degradations such as occlusions, display being outside the sensor’s field-of-view, defocus blur and perspective distortion. Due to these limitations, steganography based techniques have not been widely adopted, specially in uncontrolled settings involving consumer cameras and public displays.Computer sciencekj2321, mg3156, skn3Computer ScienceTechnical reportsMaking Lock-free Data Structures Verifiable with Artificial Transactionshttp://academiccommons.columbia.edu/catalog/ac:181977
Yuan, Xinhao; Williams-King, David Christopher; Yang, Junfeng; Sethumadhavan, Simhahttp://dx.doi.org/10.7916/D88S4NRPFri, 30 Jan 2015 00:00:00 +0000Among all classes of parallel programming abstractions, lock-free data structures are considered one of the most scalable and efficient because of their fine-grained style of synchronization. However, they are also challenging for developers and tools to verify because of the huge number of possible interleavings that result from fine-grained synchronizations. This paper address this fundamental problem between performance and verifiability of lock-free data structures. We present TXIT, a system that greatly reduces the set of possible interleavings by inserting transactions into the implementation of a lock-free data structure. We leverage hardware transactional memory support from Intel Haswell processors to enforce these artificial transactions. Evaluation on six popular lock-free data structures shows that TXIT makes it easy to verify lock-free data structures while incurring acceptable runtime overhead. Further analysis shows that two inefficiencies in Haswell are the largest contributors to this overhead.Computer sciencexy2189, dcw2131, jy2324, ss3418Computer ScienceTechnical reportsThe Internet is a Series of Tubeshttp://academiccommons.columbia.edu/catalog/ac:181980
Schulzrinne, Henning G.http://dx.doi.org/10.7916/D81C1VPTFri, 30 Jan 2015 00:00:00 +0000Internet adoption tends to progress in phases: early rapid adoption in urban areas and by relatively well-educated and higher-income households, then transitioning to a slower pace once between 70 and 85% of households subscribe. For example, in a 2013 survey, the Pew Research Internet Project found that approximately 15% of the adult population does not use the Internet; those least likely to use the Internet include “senior citizens, adults with less than a high-school education and those living in households earning less than $30,000 per year.”Computer science hgs10Computer ScienceTechnical reportsMetamorphic Runtime Checking of Applications Without Test Oracleshttp://academiccommons.columbia.edu/catalog/ac:181974
Bell, Jonathan Schaffer; Murphy, Christian; Kaiser, Gailhttp://dx.doi.org/10.7916/D8J9655PFri, 30 Jan 2015 00:00:00 +0000For some applications, it is impossible or impractical to know what the correct output should be for an arbitrary input, making testing difficult. Many machine-­learning applications for “big data”, bioinformatics and cyberphysical systems fall in this scope: they do not have a test oracle. Metamorphic Testing, a simple testing technique that does not require a test oracle, has been shown to be effective for testing such applications. We present Metamorphic Runtime Checking, a novel approach that conducts metamorphic testing of both the entire application and individual functions during a program’s execution. We have applied Metamorphic Runtime Checking to 9 machine-­‐learning applications, finding it to be on average 170% more effective than traditional metamorphic testing at only the full application level.Computer sciencejsb2125, gek1Computer ScienceTechnical reportsPhosphor: Illuminating Dynamic Data Flow in the JVM (Artifact for Evaluation)http://academiccommons.columbia.edu/catalog/ac:182689
Bell, Jonathan Schaffer; Kaiser, Gail E.http://dx.doi.org/10.7916/D84F1PH4Tue, 13 Jan 2015 00:00:00 +0000Dynamic taint analysis is a well-known information flow analysis problem with many possible applications. Taint tracking allows for analysis of application data flow by assigning labels to inputs, and then propagating those labels through data flow. Taint tracking systems traditionally compromise among performance, precision, accuracy, and portability. Performance can be critical, as these systems are typically intended to be deployed with software, and hence must have low overhead. To be deployed in security-conscious settings, taint tracking must also be accurate and precise. Dynamic taint tracking must be portable in order to be easily deployed and adopted for real world purposes, without requiring recompilation of the operating system or language interpreter, and without requiring access to application source code. We present Phosphor, a dynamic taint tracking system for the Java Virtual Machine (JVM) that simultaneously achieves our goals of performance, accuracy, precision, and portability. Moreover, to our knowledge, it is the first portable general purpose taint tracking system for the JVM. We evaluated Phosphor's performance on two commonly used JVM languages (Java and Scala), on two versions of two commonly used JVMs (Oracle's HotSpot and OpenJDK's IcedTea) and on Android's Dalvik Virtual Machine, finding its performance to be impressive: as low as 3% (53% on average), using the DaCapo macro benchmark suite. This artifact contains the code needed to reproduce the experiments detailed in our paper.Computer sciencejsb2125, gek1Computer ScienceComputer softwareStable Multithreading: A New Paradigm for Reliable and Secure Threadshttp://academiccommons.columbia.edu/catalog/ac:181469
Cui, Heminghttp://dx.doi.org/10.7916/D83N225BWed, 07 Jan 2015 00:00:00 +0000Multi threaded programs have become pervasive and critical due to the rise of the multi core hardware and the accelerating computational demand. Unfortunately, despite decades of research and engineering effort, these programs remain notoriously difficult to get right, and they are plagued with harmful concurrency bugs that can cause wrong outputs, program crashes, security breaches, and so on. Our research reveals that a root cause of this difficulty is that multithreaded programs have too many possible thread interleavings (or schedules) at runtime. Even given only a single input, a program may run into a great number of schedules, depending on factors such as hardware timing and OS scheduling. Considering all inputs, the number of schedules is even much greater. It is extremely challenging to understand, test, analyze, or verify this huge number of schedules for a multi threaded program and make sure that all these schedules are free of concurrency bugs. Thus, multi threaded programs are extremely difficult to get right. To reduce the number of possible schedules for all inputs, we looked into the relation between inputs and schedules of real-world programs, and made an exciting discovery: many programs need only a small set of schedules to efficiently process a wide range of inputs! Leveraging this discovery, we have proposed a new idea called Stable Multithreading (or StableMT) that reuses each schedule on a wide range of inputs, greatly reducing the number of possible schedules for all inputs. By addressing the root cause that makes multithreading difficult to get right, StableMT makes understanding, testing, analyzing, and verification of multithreaded programs much easier. To realize StableMT, we have built three StableMT systems, TERN, PEREGRINE, and PARROT, with each addressing a distinct research challenge. Evaluation on a wide range of 108 popular multithreaded programs with our latest StableMT system, PARROT, shows that StableMT is simple, fast, and deployable. All PARROT's source code, entire benchmarks, and raw evaluation results are available at http://github.com/columbia/smt-mc. To encourage deployment, we have applied StableMT to improve several reliability techniques, including: (1) making reproducing real world concurrency bugs much easier; (2) greatly improving the precision of static program analysis, leading to the detection of several new harmful data races in heavily tested programs; and (3) greatly increasing the coverage of model checking, a systematic testing technique, by many orders of magnitudes. StableMT has attracted the research community's interests, and some techniques and ideas in our StableMT systems have been leveraged by other researchers to compute a small set of schedules to cover all or most inputs for multi threaded programs.Computer sciencehc2428Computer ScienceDissertationsInformation Flow Auditing in the Cloudhttp://academiccommons.columbia.edu/catalog/ac:179733
Zavou, Angelikihttp://dx.doi.org/10.7916/D82B8WQ9Thu, 13 Nov 2014 00:00:00 +0000As cloud technology matures and trendsetters like Google, Amazon, Microsoft, Apple, and VMware have become the top-tier cloud services players, public cloud services have turned mainstream for individual users. In this work, I propose a set of techniques that can be used as the basis for alleviating cloud customers' privacy concerns and elevating their condence in using the cloud for security-sensitive operations as well as trusting it with their sensitive data. The main goal is to provide cloud customers' with a reliable mechanism that will cover the entire path of tracking their sensitive data, while they are collected and used by cloud-hosted services, to the presentation of the tracking results to the respective data owners. In particular, my design accomplishes this goal by retrofitting legacy applications with data flow tracking techniques and providing the cloud customers with comprehensive information flow auditing capabilities. For this purpose, we created CloudFence, a cloud-wide fine-grained data flow tracking (DFT) framework, that sensitive data in well-defined domains, offering additional protection against inadvertent leaks and unauthorized access.Computer scienceaz2172Computer ScienceDissertationsDetecting Inappropriate Clarification Requests in Spoken Dialogue Systemshttp://academiccommons.columbia.edu/catalog/ac:179539
Liu, Alex; Sloan, Rose; Then, Mei-Vern; Stoyanchev, Svetlana; Hirschberg, Julia; Shriberg, Elizabethhttp://dx.doi.org/10.7916/D8833QPBTue, 11 Nov 2014 00:00:00 +0000Spoken Dialogue Systems ask for clarification when they think they have misunderstood users. Such requests may differ depending on the information the system believes it needs to clarify. However, when the error type or location is misidentified, clarification requests appear confusing or inappropriate. We describe a classifier that identifies inappropriate requests, trained on features extracted from user responses in laboratory studies. This classifier achieves 88.5% accuracy and .885 F-measure in detecting such requests.Technical communicational3037Computer SciencePresentationsDefending against Return-Oriented Programminghttp://academiccommons.columbia.edu/catalog/ac:179721
Pappas, Vasileioshttp://dx.doi.org/10.7916/D8CZ35VHMon, 10 Nov 2014 00:00:00 +0000Return-oriented programming (ROP) has become the primary exploitation technique for system compromise in the presence of non-executable page protections. ROP exploits are facilitated mainly by the lack of complete address space randomization coverage or the presence of memory disclosure vulnerabilities, necessitating additional ROP-specific mitigations. Existing defenses against ROP exploits either require source code or symbolic debugging information, or impose a significant runtime overhead, which limits their applicability for the protection of third-party applications. We propose two novel techniques to prevent ROP exploits on third-party applications without requiring their source code or debug symbols, while at the same time incurring a minimal performance overhead. Their effectiveness is based on breaking an invariant of ROP attacks: knowledge of the code layout, and a common characteristic: unrestricted use of indirect branches. When combined, they still retain their applicability and efficiency, while maximizing the protection coverage against ROP. The first technique, in-place code randomization, uses narrow-scope code transformations that can be applied statically, without changing the location of basic blocks, allowing the safe randomization of stripped binaries even with partial disassembly coverage. These transformations effectively eliminate 10%, and probabilistically break 80% of the useful instruction sequences found in a large set of PE files. Since no additional code is inserted, in-place code randomization does not incur any measurable runtime overhead, enabling it to be easily used in tandem with existing exploit mitigations such as address space layout randomization. Our evaluation using publicly available ROP exploits and two ROP code generation toolkits demonstrates that our technique prevents the exploitation of the tested vulnerable Windows 7 applications, including Adobe Reader, as well as the automated construction of alternative ROP payloads that aim to circumvent in-place code randomization using solely any remaining unaffected instruction sequences. The second technique is based on the detection of abnormal control transfers that take place during ROP code execution. This is achieved using hardware features of commodity processors, which incur negligible runtime overhead and allow for completely transparent operation without requiring any modifications to the protected applications. Our implementation for Windows 7, named kBouncer, can be selectively enabled for installed programs in the same fashion as user-friendly mitigation toolkits like Microsoft's EMET. The results of our evaluation demonstrate that kBouncer has low runtime overhead of up to 4%, when stressed with specially crafted workloads that continuously trigger its core detection component, while it has negligible overhead for actual user applications. In our experiments with in-the-wild ROP exploits, kBouncer successfully protected all tested applications, including Internet Explorer, Adobe Flash Player, and Adobe Reader. In addition, we introduce a technique that enables ASLR for executables with stripped relocation information by incrementally adjusting stale absolute addresses at runtime. The technique relies on runtime monitoring of memory accesses and control flow transfers to the original location of a module using page table manipulation. We have implemented a prototype of the proposed technique for Windows 8, which is transparently applicable to third-party stripped binaries. Our results demonstrate that incremental runtime relocation patching is practical, incurs a runtime overhead of up to 83% in most of the cases for initial runs of protected programs, and has a low runtime overhead of 5% on subsequent runs.Computer sciencevp2214Computer ScienceDissertationsSocietal Computinghttp://academiccommons.columbia.edu/catalog/ac:179089
Sheth, Swapneelhttp://dx.doi.org/10.7916/D86T0K8SWed, 29 Oct 2014 00:00:00 +0000As Social Computing has increasingly captivated the general public, it has become a popular research area for computer scientists. Social Computing research focuses on online social behavior and using artifacts derived from it for providing recommendations and other useful community knowledge. Unfortunately, some of that behavior and knowledge incur societal costs, particularly with regards to Privacy, which is viewed quite differently by different populations as well as regulated differently in different locales. But clever technical solutions to those challenges may impose additional societal costs, e.g., by consuming substantial resources at odds with Green Computing, another major area of societal concern. We propose a new crosscutting research area, Societal Computing, that focuses on the technical tradeoffs among computational models and application domains that raise significant societal issues. We highlight some of the relevant research topics and open problems that we foresee in Societal Computing. We feel that these topics, and Societal Computing in general, need to gain prominence as they will provide useful avenues of research leading to increasing benefits for society as a whole. This thesis will consist of the following four projects that aim to address the issues of Societal Computing. First, privacy in the context of ubiquitous social computing systems has become a major concern for society at large. As the number of online social computing systems that collect user data grows, concerns with privacy are further exacerbated. Examples of such online systems include social networks, recommender systems, and so on. Approaches to addressing these privacy concerns typically require substantial extra computational resources, which might be beneficial where privacy is concerned, but may have significant negative impact with respect to Green Computing and sustainability, another major societal concern. Spending more computation time results in spending more energy and other resources that make the software system less sustainable. Ideally, what we would like are techniques for designing software systems that address these privacy concerns but which are also sustainable — systems where privacy could be achieved “for free,” i.e., without having to spend extra computational effort. We describe how privacy can indeed be achieved for free — an accidental and beneficial side effect of doing some existing computation — in web applications and online systems that have access to user data. We show the feasibility, sustainability, and utility of our approach and what types of privacy threats it can mitigate. Second, we aim to understand what the expectations and needs to end-users and software developers are, with respect to privacy in social systems. Some questions that we want to answer are: Do end-users care about privacy? What aspects of privacy are the most important to end-users? Do we need different privacy mechanisms for technical vs. non-technical users? Should we customize privacy settings and systems based on the geographic location of the users? We have created a large scale user study using an online questionnaire to gather privacy requirements from a variety of stakeholders. We also plan to conduct follow-up semistructured interviews. This user study will help us answer these questions. Third, a related challenge to above, is to make privacy more understandable in complex systems that may have a variety of user interface options, which may change often. Our approach is to use crowdsourcing to find out how other users deal with privacy and what settings are commonly used to give users feedback on aspects like how public/private their settings are, what common settings are typically used by others, where do a certain users’ settings differ from a trusted group of friends, etc. We have a large dataset of privacy settings for over 500 users on Facebook and we plan to create a user study that will use the data to make privacy settings more understandable. Finally, end-users of such systems find it increasingly hard to understand complex privacy settings. As software evolves over time, this might introduce bugs that breach users’ privacy. Further, there might be system-wide policy changes that could change users’ settings to be more or less private than before. We present a novel technique that can be used by end-users for detecting changes in privacy, i.e., regression testing for privacy. Using a social approach for detecting privacy bugs, we present two prototype tools. Our evaluation shows the feasibility and utility of our approach for detecting privacy bugs. We highlight two interesting case studies on the bugs that were discovered using our tools. To the best of our knowledge, this is the first technique that leverages regression testing for detecting privacy bugs from an end-user perspective.Computer scienceComputer ScienceTechnical reportsFailure Analysis of the New York City Power Gridhttp://academiccommons.columbia.edu/catalog/ac:179086
Wu, Leon; Anderson, Roger N.; Boulanger, Albert G.; Rudin, Cynthia; Kaiser, Gail E.http://dx.doi.org/10.7916/D8G73CBZWed, 29 Oct 2014 00:00:00 +0000As U.S. power grid transforms itself into Smart Grid, it has become less reliable in the past years. Power grid failures lead to huge financial cost and affect people’s life. Using a statistical analysis and holistic approach, this paper analyzes the New York City power grid failures: failure patterns and climatic effects. Our findings include: higher peak electrical load increases likelihood of power grid failure; increased subsequent failures among electrical feeders sharing the same substation; underground feeders fail less than overhead feeders; cables and joints installed during certain years are more likely to fail; higher weather temperature leads to more power grid failures. We further suggest preventive maintenance, intertemporal consumption, and electrical load optimization for failure prevention. We also estimated that the predictability of the power grid component failures correlates with the cycles of the North Atlantic Oscillation (NAO) Index.Computer sciencellw2107, rna1, agb6, gek1Computer Science, Center for Computational Learning SystemsTechnical reportsPhasor Imaging: A Generalization of Correlation-Based Time-of-Flight Imaginghttp://academiccommons.columbia.edu/catalog/ac:178997
Gupta, Mohit; Nayar, Shree K.; Hullin, Matthias B.; Martin, Jaimehttp://dx.doi.org/10.7916/D8P26WRTMon, 27 Oct 2014 00:00:00 +0000In correlation-based time-of-flight (C-ToF) imaging systems, light sources with temporally varying intensities illuminate the scene. Due to global illumination, the temporally varying radiance received at the sensor is a combination of light received along multiple paths. Recovering scene properties (e.g., scene depths) from the received radiance requires separating these contributions, which is challenging due to the complexity of global illumination and the additional temporal dimension of the radiance. We propose phasor imaging, a framework for performing fast inverse light transport analysis using C-ToF sensors. Phasor imaging is based on the idea that by representing light transport quantities as phasors and light transport events as phasor transformations, light transport analysis can be simplified in the temporal frequency domain. We study the effect of temporal illumination frequencies on light transport, and show that for a broad range of scenes, global radiance (multi-path interference) vanishes for frequencies higher than a scene-dependent threshold. We use this observation for developing two novel scene recovery techniques. First, we presentMicro ToF imaging, a ToF based shape recovery technique that is robust to errors due to global illumination. Second, we present a technique for separating the direct and global components of radiance. Both techniques require capturing as few as 3 − 4 images and minimal computations. We demonstrate the validity of the presented techniques via simulations and experiments performed with our hardware prototype.Computer sciencemg3156, skn3Computer ScienceTechnical reportsRepeatable Reverse Engineering for the Greater Good with PANDAhttp://academiccommons.columbia.edu/catalog/ac:179006
Dolan-Gavitt, Brendan F.; Hodosh, Josh; Hulin, Patrick; Leek, Tim; Whelan, Ryanhttp://dx.doi.org/10.7916/D8WM1C1PMon, 27 Oct 2014 00:00:00 +0000We present PANDA, an open-source tool that has been purpose-built to support whole system reverse engineering. It is built upon the QEMU whole system emulator, and so analyses have access to all code executing in the guest and all data. PANDA adds the ability to record and replay executions, enabling iterative, deep, whole system analyses. Further, the replay log files are compact and shareable, allowing for repeatable experiments. A nine billion instruction boot of FreeBSD, e.g., is represented by only a few hundred MB. Further, PANDA leverages QEMU's support of thirteen different CPU architectures to make analyses of those diverse instruction sets possible within the LLVM IR. In this way, PANDA can have a single dynamic taint analysis, for example, that precisely supports many CPUs. PANDA analyses are written in a simple plugin architecture which includes a mechanism to share functionality between plugins, increasing analysis code re-use and simplifying complex analysis development. We demonstrate PANDA's effectiveness via a number of use cases, including enabling an old but legitimate version of Starcraft to run despite a lost CD key, in-depth diagnosis of an Internet Explorer crash, and uncovering the censorship activities and mechanisms of a Chinese IM client.Computer sciencebd2433Computer ScienceTechnical reportsHigh Availability for Carrier-Grade SIP Infrastructure on Cloud Platformshttp://academiccommons.columbia.edu/catalog/ac:179012
Kim, Jong Yul; Schulzrinne, Henning G.http://dx.doi.org/10.7916/D8N29VK0Mon, 27 Oct 2014 00:00:00 +0000SIP infrastructure on cloud platforms has the potential to be both scalable and highly available. In our previous project, we focused on the scalability aspect of SIP services on cloud platforms; the focus of this project is on the high availability aspect. We investigated the effects of component fault on service availability with the goal of understanding how high availability can be guaranteed even in the face of component faults. The experiments were conducted empirically on a real system that runs on Amazon EC2. Our analysis shows that most component faults are masked with a simple automatic failover technique. However, we have also identified fundamental problems that cannot be addressed by simple failover techniques; a problem involving DNS cache in resolvers and a problem involving static failover configurations. Recommendations on how to solve these problems are included in the report.Computer sciencejk2520, hgs10Computer ScienceTechnical reportsKamino: Dynamic Approach to Semantic Code Clone Detectionhttp://academiccommons.columbia.edu/catalog/ac:179003
Neubauer, Lindsay Annehttp://dx.doi.org/10.7916/D8542M79Mon, 27 Oct 2014 00:00:00 +0000Discovering code clones in a runtime environment helps software engineers identify hard to find logic-based bugs. Yet most research in the area of code clone discovery deals with source code due to the complexity of finding clones in a dynamic environment. KAMINO manipulates Java bytecode to track control and data flow dependencies at the methodlevel of Java programs during runtime. It then matches similar flows to find semantic code clones. With positive preliminary results indicating code clones using KAMINO , future tests will compare the its robustness compared to existing code clones detection tools.Computer sciencelan2135Computer ScienceTechnical reportsDetecting, Isolating and Enforcing Dependencies Between and Within Test Caseshttp://academiccommons.columbia.edu/catalog/ac:179000
Bell, Jonathan Schafferhttp://dx.doi.org/10.7916/D8DN43P1Mon, 27 Oct 2014 00:00:00 +0000Testing stateful applications is challenging, as it can be difficult to identify hidden dependencies on program state. These dependencies may manifest between several test cases, or simply within a single test case. When it's left to developers to document, understand, and respond to these dependencies, a mistake can result in unexpected and invalid test results. Although current testing infrastructure does not currently leverage state dependency information, we argue that it could, and that by doing so testing can be improved. Our results thus far show that by recovering dependencies between test cases and modifying the popular testing framework, JUnit, to utilize this information, we can optimize the testing process, reducing time needed to run tests by 62% on average. Our ongoing work is to apply similar analyses to improve existing state of the art test suite prioritization techniques and state of the art test case generation techniques. This work is advised by Professor Gail Kaiser.Computer sciencejsb2125Computer ScienceTechnical reportsHybrid System Combination for Machine Translation: An Integration of Phrase-level and Sentences-level Combination Approacheshttp://academiccommons.columbia.edu/catalog/ac:178881
Ma, Wei-Yunhttp://dx.doi.org/10.7916/D8JS9P3ZWed, 15 Oct 2014 00:00:00 +0000Given the wide range of successful statistical MT approaches that have emerged recently, it would be beneficial to take advantage of their individual strengths and avoid their individual weaknesses. Multi-Engine Machine Translation (MEMT) attempts to do so by either fusing the output of multiple translation engines or selecting the best translation among them, aiming to improve the overall translation quality. In this thesis, we propose to use the phrase or the sentence as our combination unit instead of the word; three new phrase-level models and one sentence-level model with novel features are proposed. This contrasts with the most popular system combination technique to date which relies on word-level confusion network decoding. Among the three new phrase-level models, the first one utilizes source sentences and target translation hypotheses to learn hierarchical phrases -- phrases that contain subphrases (Chiang 2007). It then re-decodes the source sentences using the hierarchical phrases to combine the results of multiple MT systems. The other two models we propose view combination as a paraphrasing process and use paraphrasing rules. The paraphrasing rules are composed of either string-to-string paraphrases or hierarchical paraphrases, learned from monolingual word alignments between a selected best translation hypothesis and other hypotheses. Our experimental results show that all of the three phrase-level models give superior performance in BLEU compared with the best single translation engine. The two paraphrasing models outperform the re-decoding model and the confusion network baseline model. The sentence-level model exploits more complex syntactic and semantic information than the phrase-level models. It uses consensus, argument alignment, a supertag-based structural language model and a syntactic error detector. We use our sentence-level model in two ways: the first selects a translated sentence from multiple MT systems as the best translation to serve as a backbone for paraphrasing process; the second makes the final decision among all fused translations generated by the phrase-level models and all translated sentences of multiple MT systems. We proposed two novel hybrid combination structures for the integration of phrase-level and sentence-level combination frameworks in order to utilize the advantages of both frameworks and provide a more diverse set of plausible fused translations to consider.Computer sciencewm2174Computer ScienceDissertationsA Fractional Programming Framework for Support Vector Machine-type Formulationshttp://academiccommons.columbia.edu/catalog/ac:178492
Vovsha, Iliahttp://dx.doi.org/10.7916/D8M61HVGMon, 13 Oct 2014 00:00:00 +0000We develop a theoretical framework for relating various formulations of regularization problems through fractional programming. We focus on problems with objective functions of the type L + λ · P , where the parameter λ lacks intuitive interpretation. We observe that fractional programming is an elegant approach to obtain bounds on the range of the parameter, and then generalize this approach to show that different forms can be obtained from a common fractional program. Furthermore, we apply the proposed framework in two concrete settings; we consider support vector machines (SVMs), where the framework clarifies the relation between various existing soft-margin dual forms for classification, and the SVM+ algorithm (Vapnik and Vashist, 2009), where we use this methodology to derive a new dual formulation, and obtain bounds on the cost parameter.Computer scienceiv2121Computer Science, Center for Computational Learning SystemsTechnical reportsMethods for Inference in Graphical Modelshttp://academiccommons.columbia.edu/catalog/ac:178874
Weller, Adrianhttp://dx.doi.org/10.7916/D8JD4VDCMon, 13 Oct 2014 00:00:00 +0000Graphical models provide a flexible, powerful and compact way to model relationships between random variables, and have been applied with great success in many domains. Combining prior beliefs with observed evidence to form a prediction is called inference. Problems of great interest include finding a configuration with highest probability (MAP inference) or solving for the distribution over a subset of variables (marginal inference). Further, these methods are often critical subroutines for learning the relationships. However, inference is computationally intractable in general. Hence, much effort has focused on two themes: finding subdomains where exact inference is solvable efficiently, or identifying approximate methods that work well. We explore both these themes, restricting attention to undirected graphical models with discrete variables. First we address exact MAP inference by advancing the recent method of reducing the problem to finding a maximum weight stable set (MWSS) on a derived graph, which, if perfect, admits polynomial time inference. We derive new results for this approach, including a general decomposition theorem for models of any order and number of labels, extensions of results for binary pairwise models with submodular cost functions to higher order, and a characterization of which binary pairwise models can be efficiently solved with this method. This clarifies the power of the approach on this class of models, improves our toolbox and provides insight into the range of tractable models. Next we consider methods of approximate inference, with particular emphasis on the Bethe approximation, which is in widespread use and has proved remarkably effective, yet is still far from being completely understood. We derive new formulations and properties of the derivatives of the Bethe free energy, then use these to establish an algorithm to compute log of the optimum Bethe partition function to arbitrary epsilon-accuracy. Further, if the model is attractive, we demonstrate a fully polynomial-time approximation scheme (FPTAS), which is an important theoretical result, and demonstrate its practical applications. Next we explore ways to tease apart the two aspects of the Bethe approximation, i.e. the polytope relaxation and the entropy approximation. We derive analytic results, show how optimization may be explored over various polytopes in practice, even for large models, and remark on the observed performance compared to the true distribution and the tree-reweighted (TRW) approximation. This reveals important novel observations and helps guide inference in practice. Finally, we present results related to clamping a selection of variables in a model. We derive novel lower bounds on an array of approximate partition functions based only on the model's topology. Further, we show that in an attractive binary pairwise model, clamping any variable and summing over the approximate sub-partition functions can only increase (hence improve) the Bethe approximation, then use this to provide a new, short proof that the Bethe partition function lower bounds the true value for this class of models. The bulk of this work focuses on the class of binary, pairwise models, but several results apply more generally.Computer scienceComputer ScienceDissertationsSound and Precise Analysis of Multithreaded Programs through Schedule Specialization and Execution Filtershttp://academiccommons.columbia.edu/catalog/ac:178219
Wu, Jingyuehttp://dx.doi.org/10.7916/D8BZ64N7Tue, 07 Oct 2014 00:00:00 +0000Multithreaded programs are known to be difficult to analyze. A key reason is that they typically have an enormous number of execution interleavings, or schedules. Static analysis with respect to all schedules requires over-approximation, resulting in poor precision; dynamic analysis rarely covers more than a tiny fraction of all schedules, so its result may not hold for schedules not covered. To address this challenge, we propose a novel approach called schedule specialization that restricts the schedules of a program to make it easier to analyze. Schedule specialization combines static and dynamic analysis. It first statically analyzes a multithreaded program with respect to a small set of schedules for precision, and then enforces these schedules at runtime for soundness of the static analysis results. To demonstrate that this approach works, we build three systems. The first system is a specialization framework that specializes a program into a simpler program based on a schedule for precision. It allows stock analyses to automatically gain precision with only little modification. The second system is Peregrine, a deterministic multithreading system that collects and enforces schedules on future inputs. Peregrine reuses a small set of schedules on many inputs, ensuring our static analysis results to be sound for a wide range of inputs. It also enforces these schedules efficiently, making schedule specialization suitable for production usage. Although schedule specialization can make static concurrency error detection more precise, some concurrency errors such as races may still slip detection and enter production systems. To mitigate this limitation, we build Loom, a live-workaround system that protects a live multithreaded program from races that slip detection. It allows developers to easily write execution filters to safely and efficiently work around deployed races in live multithreaded programs without restarting them.Computer sciencejw2671Computer ScienceDissertationsA Novel Quantification Method for Determining Previously Undetected Silent Infarcts on MR-perfusion in Patients Following Carotid Endarterectomyhttp://academiccommons.columbia.edu/catalog/ac:177675
Liu, Xin; Imielinska, Celina Z.; Rosiene, Joel; Rampersad, Anita; Wilson, David; Zurica, Joseph; Halazun, Hadi; Williams, Susan C.; Ligneli, Angela; D'Ambrosio, Anthony; Sughrue, Michael; Connolly Jr., E. Sander; Heyer, Eric J.http://dx.doi.org/10.7916/D89G5KB2Mon, 29 Sep 2014 00:00:00 +0000The purpose of this paper is to evaluate the post-operative Magnetic Resonance Perfusion (MRP) scans of patients undergoing carotid endarterectomy (CEA), using a novel image-analysis algorithm, to determine if post-operative neurocognitive decline is associated with cerebral blood flow changes. CEA procedure reduces the risk of stroke in appropriately selected patients with significant carotid artery stenosis. However, 25% of patients experience subtle cognitive deficits after CEA compared to a control group. It was hypothesized that abnormalities in cerebral blood flow (CBF) are responsible for these cognitive deficits. A novel algorithm for analyzing MRperfusion (MRP) scans to identify and quantify the amount of CBF asymmetry in each hemisphere was developed and to quantify the degree of relative difference between three corresponding vascular regions in the ipsilateral and contralateral hemispheres, the Relative Difference Map (RDM). Patients undergoing CEA and spine surgery (controls) were examined preoperatively, and one day postoperatively with a battery of neuropsychometric (NPM) tests, and labeled “injured” patients with significant cognitive deficits, and “normal” if they demonstrated no decline in neurocognitive function. There are apparently significant RDM differences with MRP scans between the two hemispheres in patients with cognitive deficits which can be used to guide expert reviews of the imagery. The proposed methodology aids in the analysis of MRP parameters in patients with cognitive impairment.Bioinformatics, Medical imaging and radiology, Neurosciencesci42, esc2181, ejh3Neurological Surgery, Biomedical Informatics, Computer Science, Anesthesiology, RadiologyConferencesEvaluation of Ischemic Stroke Hybrid Segmentation in a Rat Model of Temporary Middle Cerebral Artery Occlusion using Ground Truth from Histologic and MR datahttp://academiccommons.columbia.edu/catalog/ac:177672
Imielinska, Celina Z.; Jin, Yinpeng; Liu, Xin; Rosiene, Joel; Zacharia, Brad E.; Komotar, Ricardo J.; Mocco, J.; Sughrue, Michael E.; Grobelny, Bartosz; Sisti, Alex; Silverberg, Josh; Khandji, Joyce; Cohen, Hillary; Connolly Jr., E. Sander; D'Ambrosio, Anthony L.http://dx.doi.org/10.7916/D8K072TJMon, 29 Sep 2014 00:00:00 +0000A segmentation method that quantifies cerebral infarct using rat data with ischemic stroke is evaluated using ground truth from histologic and MR data. To demonstrate alternative approach to rapid quantification of cerebral infarct volumes using histologic stained slices that requires scarifying animal life, a study with MR acquire volumetric rat data is proposed where ground truth is obtained by manual delineations by experts and automated segmentation is assessed for accuracy. A framework for evaluation of segmentation is used that provides more detailed accuracy measurements than mere cerebral infarct volume. Our preliminary experiment shows that ground truth derived from MRI data is at least as good as the one obtained from the histologic slices for evaluating segmentation algorithms for accuracy. Therefore we can develop and evaluate automated segmentation methods for rapid quantification of stroke without the necessitating animal sacrifice.Bioinformatics, Neurosciences, Medical imaging and radiologyci42, esc2181Neurological Surgery, Biomedical Informatics, Computer Science, Biomedical EngineeringConferencesSemantic Relations in a Medical Digital Libraryhttp://academiccommons.columbia.edu/catalog/ac:177690
Wacholder, Nina; Imielinska, Celina Z.; Soliz, Ewa; Klavans, Judith; Molholt, Pathttp://dx.doi.org/10.7916/D8J101Q5Mon, 29 Sep 2014 00:00:00 +0000In this paper, we describe the VesaliusTM Project, a multi-modal collection of anatomical resources under development at Columbia University. 1 Our focus is on the need for navigational tools to effectively access the wealth of electronic information on anatomy, including life-like 3D images of anatomical entities that can be interactively viewed and browsed. We describe a key component which must be in place in order to develop a flexible and reusable digital library system, namely an anatomical knowledge base containing a `nucleus' of anatomical information specically designed to make it possible to develop a wide spectrum of curriculum applications that use and extend the information in the knowledge base. The unique contribution of our research lies in the dual focus on user needs and on effective use of knowledge representation theory in order to develop a system that makes it possible to take advantage of interactive 3D models and the wealth of other anatomical data now available.Library science, Technical communication, Health sciencesci42, jlk18Neurobiology and Behavior, Biomedical Informatics, Computer Science, Center for Research on Information AccessConferencesSemi-automated Color Segmentation of Anatomical Tissuehttp://academiccommons.columbia.edu/catalog/ac:177693
Imielinska, Celina Z.; Downes, M. S.; Yuan, W.http://dx.doi.org/10.7916/D88G8J8KMon, 29 Sep 2014 00:00:00 +0000We propose a semi-automated region-based color segmentation algorithm to extract anatomical structures, including soft tissues, in the color anatomy slices of the Visible Human data. Our approach is based on repeatedly dividing an image into regions using Voronoi diagrams and classifying the regions based on experimental classification statistics. The user has the option of reclassifying regions in order to improve the final boundary. Our results indicate that the algorithm can find accurate outlines in a small number of iterations and that manual interaction can markedly improve the outline. This approach can be extended to 3D color segmentation.Medical imaging and radiology, Bioinformatics, Computer scienceci42Biomedical Informatics, Computer ScienceArticlesConventional Orthography for Dialectal Arabic (CODA): Principles and Guidelines -- Egyptian Arabic - Version 0.7 - March 2012http://academiccommons.columbia.edu/catalog/ac:177659
Habash, Nizar Y.; Diab, Mona T.; Rambow, Owen C.http://dx.doi.org/10.7916/D83X8562Fri, 26 Sep 2014 00:00:00 +0000This document introduces CODA (Conventional Orthography for Dialectal Arabic) and presents specifications and detailed guidelines for Egyptian Arabic CODA. CODA addresses the problem of inconsistent orthographic choices in raw (naturally occurring) written dialectal Arabic text. The specifications are a succinct summary, while the guidelines contain details and examples. The document has three parts that are ordered from most general to the more specific. In Part 1, we define CODA and present its general goals, principles and considerations in a non-dialect specific manner. In Part 2, we present a high level CODA specification for Egyptian Arabic (EGY). And in Part 3, we present detailed guidelines for EGY CODA.Computer sciencenh2142, md2370, ocr2101Computer Science, Center for Computational Learning SystemsTechnical reportsDevelopment of a Training Tool for Endotracheal Intubation: Distributed Augmented Realityhttp://academiccommons.columbia.edu/catalog/ac:177600
Rolland, Jannick; Davis, Larry; Hamza-Lup, Felix; Daly, Jason; Ha, Yonggang; Martin, Glenn; Norfleet, Jack; Thumann, Richard; Imielinska, Celina Z.http://dx.doi.org/10.7916/D8CJ8C10Wed, 24 Sep 2014 00:00:00 +0000The authors introduce a tool referred to as the Ultimate Intubation Head (UIH) to train medical practitioners’ hand-eye coordination in performing endotracheal intubation with the help of augmented reality methods. In this paper we describe the integration of a deployable UIH and present methods for augmented reality registration of real and virtual anatomical models. The assessment of the 52 degrees field of view optics of the custom-designed and built head-mounted display is less than 1.5 arc minutes in the amount of blur and astigmatism, the two limiting optical aberrations. Distortion is less than 2.5%. Preliminary results of the registration of a physical phantom mandible on its virtual counterpart yields less than 3mm rms. in registration. Finally we describe an approach to distributed visualization where a given training procedure may be visualized and shared at various remote locations. Basic assessments of delays within two scenarios of data distribution were conducted and reported.Medicine, Computer science, Educational technologyci42Biomedical Informatics, Computer ScienceConferencesGround Truth for Evaluation of Ischemic Stroke Hybrid Segmentation in a Rat Model of Temporary Middle Cerebral Artery Occlusionhttp://academiccommons.columbia.edu/catalog/ac:177627
Imielinska, Celina Z.; Rosiene, J.; Jin, Y.; Liu, Xin; Udupa, J.; Zacharia, B.; Komotar, R.; Mocco, J.; Sughrue, M.; Grobelny, B.; Sisti, A.; Silverberg, J.; Khandji, J.; Cohen, H.; Connolly Jr., E. Sander; D'Ambrosio, Anthonyhttp://dx.doi.org/10.7916/D89P306JWed, 24 Sep 2014 00:00:00 +0000In vivo rodent models of focal cerebral ischemia have been developed to investigate stroke therapy. Typically these models require rapid quantification of cerebral infarct volumes using vital stains with tetrazolium salts to delineate the extent of neuronal death. To avoid animal sacrifice, we sought a study with MR acquired volumetric rata data where surrogate of ground truth is obtained by repeated manual delineation by experts, and an automated hybrid segmentation is evaluated for accuracy. We propose a rating system for the expert delineations that captures intra- and inter-expert discrepancy. Our preliminary results show that surrogate ground truth derived from MR data is at least as good as the one derived from histologic stained slices. Hence animal sacrifice is not necessary to evaluate ischemic stroke automated segmentation in a rat model of temporary middle cerebral artery occlusion.Bioinformatics, Neurosciences, Medical imaging and radiologyci42, esc2181, ad3197Neurological Surgery, Biomedical Informatics, Computer Science, Biomedical EngineeringA Novel Drill Set for the Enhancement and Assessment of Robotic Surgical Performancehttp://academiccommons.columbia.edu/catalog/ac:177630
Ro, Charles Y.; Toumpoulis, Ioannis K.; Ashton, Jr., Robert C.; Imielinska, Celina Z.; Jebara, Tony ; Shin, Seung H.; Zipkin, J. D.; McGinty, James J.; Todd, George J.; DeRose, Jr., Joseph J.http://dx.doi.org/10.7916/D8J67FGCWed, 24 Sep 2014 00:00:00 +0000Background: There currently exist several training modules to improve performance during video-assisted surgery. The unique characteristics of robotic surgery make these platforms an inadequate environment for the development and assessment of robotic surgical performance. Methods: Expert surgeons (n=4) (greater than 50 clinical robotic procedures and greater than 2 years of clinical robotic experience) were compared to novice surgeons (n=17) (less than 5 clinical cases and limited laboratory experience) using the da Vinci Surgical System. Seven drills were designed to simulate clinical robotic surgical tasks. Performance score wascalculated by the equation Time to Completion + (minor error) x 5 + (major error) x 10. The Robotic Learning Curve (RLC) was expressed as a trend line of the performance scores corresponding to each repeated drill. Results: Performance scores for experts were better than novices in all 7 drills (p less than 0.05). The RLC for novices reflected an improvement in scores (p less than 0.05). In contrast, experts demonstrated a flat RLC for 6 drills and an improvement in one drill (p=0.027). Conclusion: This new drill set provides a framework for performance assessment during robotic surgery. The inclusion of particular drills and their role in training robotic surgeons of the future awaits larger validation studies.Medical imaging and radiology, Biomedical engineering, Surgeryci42, tj2008Biomedical Informatics, Computer ScienceConferencesObjective Quantification of Perfusion-Weighted Computed Tomography in the Setting of Acute Aneurysmal Subarachnoid Hemorrhagehttp://academiccommons.columbia.edu/catalog/ac:177612
Imielinska, Celina Z.; Liu, Xin; Sughrue, Michael E.; Kelly, Sean; Hagiwara, Eugene; Connolly Jr., E. Sander; D'Ambrosio, Anthony L.http://dx.doi.org/10.7916/D8M32T9CWed, 24 Sep 2014 00:00:00 +0000Perfusion-Weighted Computed Tomography (CTP) is a relatively recent innovation that estimates a value for cerebral blood flow (CBF) using a series of axial head CT images which tracks the time course of signal from an administered bolus of intravenous contrast. We introduce a novel computer-based method for objective quantification of CBF values calculated from CTP images. Our method corrects for the inherent variability of the CTP methodology seen in the subarachnoid hemorrhage (SAH) patient population to potentially aid in the diagnosis of cerebral vasospasm (CVS). This method analyzes and quantifies side-to-side asymmetry of CBF and represents relative differences in a construct termed a Relative Difference Map (RDM). Herein, we present our preliminary results that show that analysis of histograms of the RDM in left and right hemispheres, as well as different vascular territories of the brain, can be used for detection and diagnosis of cerebral vasospasm in patients with SAH. While this method has been designed specifically to analyze post-processed CTP images, it could be potentially applied to quantification and analysis of MR perfusion data, as well.Neurosciences, Bioinformatics, Medical imaging and radiologyci42, esc2181Neurological Surgery, Neuroradiology, Biomedical Informatics, Computer ScienceConferencesAugmented Reality for Teaching Endotracheal Intubation: MR Imaging to Create Anatomically Correct Modelshttp://academiccommons.columbia.edu/catalog/ac:177584
Kerner, Karen F.; Imielinska, Celina Z.; Rolland, Jannick; Tang, Haiyinghttp://dx.doi.org/10.7916/D8P849DMMon, 22 Sep 2014 00:00:00 +0000Clinical procedures have traditionally been taught at the bedside, in the morgue and in the animal lab. Augmented Reality (AR) technology (the merging of virtual reality and real objects or patients) provides a new method for teaching clinical and surgical procedures. Improved patient safety is a major advantage. We describe a system which employs AR technology to teach endotracheal intubation, using the Visible Human datasets, as well as MR images from live patient volunteers.Medicine, Educational technology, Computer sciencekfk9, ci42Biomedical Informatics, Computer Science, Medicine, RadiologyConferencesMerging Augmented Reality and Anatomically Correct 3D Models in the Development of a Training Tool for Endotracheal Intubationhttp://academiccommons.columbia.edu/catalog/ac:177145
Rolland, Jannick; Davis, Larry; Hamza-Lup, Felix G.; Norfleet, Jack; Imielinska, Celina Z.; Kerner, Karen F.http://dx.doi.org/10.7916/D8QV3K1SThu, 11 Sep 2014 00:00:00 +0000Augmented reality is often used for medical training systems in which the user visualizes 3D information superimposed on the real world. In this context, we introduce a augmented reality tool to train the medical practitioner hand-eye coordination in performing critical procedures such as endotracheal intbation.Biomedical engineering, Bioinformatics, Biomechanicsci42, kfk9Biomedical Informatics, Computer Science, MedicineConferencesStatistical Bilateral Asymmetry Measurement in Brain Imageshttp://academiccommons.columbia.edu/catalog/ac:177036
Liu, Xin; Ogden, Robert T.; Imielinska, Celina Z.; Laine, Andrew F.; Connolly Jr., E. Sander; D'Ambrosio, Anthonyhttp://dx.doi.org/10.7916/D88W3BS2Tue, 09 Sep 2014 00:00:00 +0000We present an improvement of an automated generic methodology for symmetry identification, asymmetry quantification, and segmentation of brain pathologies, utilizing the inherent bi-fold mirror symmetry in brain imagery. In the pipeline of operations starting with detection of the symmetry axis, hemisphere-wise cross registration, statistical correlation and quantification of asymmetries, we segment a target brain pathology. The detection of pathological difference left to right in brain imagery is complicated by normal variations as well as geometric misalignment in anatomical structures between two hemispheres. Introducing hemisphere-wise registration and spatial correlation makes our approach perform robustly in the presence of normal asymmetries and systematic artifacts such as bias field and acquisition noise.Medical imaging and radiology, Biomedical engineering, Bioinformaticsxl2104, to166, ci42, al418, esc2181, ad3197Radiation Oncology, Neurological Surgery, Biomedical Informatics, Computer Science, Biostatistics, Biomedical EngineeringConferencesAutomatic Correction of the 3D Orientation of the Brain Imageryhttp://academiccommons.columbia.edu/catalog/ac:177039
Liu, Xin; Imielinska, Celina Z.; Connolly Jr., E. Sander; D'Ambrosio, Anthonyhttp://dx.doi.org/10.7916/D81C1VBBTue, 09 Sep 2014 00:00:00 +0000Classification of human brain pathologies can be guided by the estimation of the departure of 3D internal structures from the normal bilateral symmetry. However symmetry based analysis can 't be precisely carried out when the 3D brain orientation is misaligned, a common occurrence in clinical practice. In this paper, a technique to automatically identify the symmetry plane and correct the 3D orientation of volumetric brain images in a cost effective way is developed. The algorithm seeks the best sampling strategies to realign 3D volumetric representation of the brain within scanner coordinate system. The inertia matrix is computed on the sampled brain, and the principle axes are derived from the eigen vectors of the inertia matrix. The technique is demonstrated on MR and CT brain images and the detected symmetry plane that is orthogonal to the principle vectors is provided. A spatial affine transform is applied to rotate the 3D brain images and align them within the coordinate system of the scanner. The corrected brain volume is re-sliced such that each planar image represents the brain at the same axial level.Medical imaging and radiology, Biomedical engineering, Bioinformaticsxl2104, ci42, esc2181, ad3197Neurological Surgery, Radiation Oncology, Biomedical Informatics, Computer ScienceConferencesMulti-scale Modeling of Trauma Injuryhttp://academiccommons.columbia.edu/catalog/ac:177003
Imielinska, Celina Z.; Przekwas, Andrzej; Tan, X. G.http://dx.doi.org/10.7916/D8XK8D1QTue, 09 Sep 2014 00:00:00 +0000We develop a multi-scale high fidelity biomechanical and physiologically-based modeling tools for trauma (ballistic/impact and blast) injury to brain, lung and spinal cord for resuscitation, treatment planning and design of personnel protection. Several approaches have been used to study blast and ballistic/impact injuries. Dummy containing pressure sensors and synthetic phantoms of human organs have been used to study bomb blast and car crashes. Large animals like pigs also have been equipped with pressure sensors exposed to blast waves. But these methods do not anatomically and physiologically biofidelic to humans, do not provide full optimization of body protection design and require animal sacrifice. Anatomy and medical image based high-fidelity computational modeling can be used to analyze injury mechanisms and to optimize the design of body protection. This paper presents novel approach of coupled computational fluid dynamics (CFD) and computational structures dynamics (CSD) to simulate fluid (air, cerebrospinal fluid) solid (cranium, brain tissue) interaction during ballistic/blast impact. We propose a trauma injury simulation pipeline concept staring from anatomy and medical image based high fidelity 3D geometric modeling, extraction of tissue morphology, generation of computational grids, multiscale biomechanical and physiological simulations, and data visualization.Medical imaging and radiology, Biomedical engineering, Bioinformaticsci42Radiation Oncology, Biomedical Informatics, Computer ScienceArticlesEnhanced Techniques for Asymmetry Quantification in Brain Imageryhttp://academiccommons.columbia.edu/catalog/ac:177050
Liu, Xin; Imielinska, Celina Z.; Rosiene, Joel; Connolly Jr., E. Sander; D'Ambrosio, Anthony L.http://dx.doi.org/10.7916/D80C4T8VTue, 09 Sep 2014 00:00:00 +0000We present an automated generic methodology for symmetry identification and asymmetry quantification, novel method of identifying and delineation of brain pathology by analyzing the opposing sides of the brain utilizing of inherent leftright symmetry in the brain. After symmetry axis has been detected, we apply non-parametric statistical tests operating on the pairs of samples to identify initial seeds points which is defined defined as the pixels where the most statistically significant difference appears. Local region growing is performed on the difference map, from where the seeds are aggregating until it captures all 8-way connected high signals from the difference map. We illustrate the capability of our method with examples ranging from tumors in patient MR data to animal stroke data. The validation results on Rat stroke data have shown that this approach has promise to achieve high precision and full automation in segmenting lesions in reflectional symmetrical objects.Bioinformatics, Neurosciences, Medical imaging and radiologyci42, esc2181Neurological Surgery, Radiation Oncology, Biomedical Informatics, Computer ScienceConferencesQuantification of Diffusion-weighted Images (DWI) and Apparent Diffusion Coefficient Maps (ADC) in the Detection of Acute Strokehttp://academiccommons.columbia.edu/catalog/ac:177054
Tulipano, Paola K.; Millar, William S.; Imielinska, Celina Z.; Liu, Xin; Rosiene, Joel; D'Ambrosio, Anthony L.http://dx.doi.org/10.7916/D8DJ5D4XTue, 09 Sep 2014 00:00:00 +0000Magnetic resonance (MR) imaging is an imaging modality that is used in the management and diagnosis of acute stroke. Common MR imaging techniques such as diffusion weighted imaging (DWI) and apparent diffusion coefficient maps (ADC) are used routinely in the diagnosis of acute infarcts. However, advances in radiology information systems and imaging protocols have led to an overload of image information that can be difficult to manage and time consuming. Automated techniques to assist in the identification of acute ischemic stroke can prove beneficial to 1) the physician by providing a mechanism for early detection and 2) the patient by providing effective stroke therapy at an early stage. We have processed DW images and ADC maps using a novel automated Relative Difference Map (RDM) method that was tailored to the identification and delineation of the stroke region. Results indicate that the technique can delineate regions of acute infarctions on DW images and ADC maps. A formal evaluation of the RDM algorithm was performed by comparing accuracy measurementsbetween 1) expert generated ground truths with the RDM delineated DWI infarcts and 2) RDM delineated DWI infarcts with RDM delineated ADC infarcts. The accuracy measurements indicate that the RDM delineated DWI infarcts are comparable to the expert generated ground truths. The true positive volume fraction value (TPVF), between RDM delineated DWI and ADC infarcts, is nonzero for all cases with an acute infarct while the value for non-acute cases remains zero.Bioinformatics, Medical imaging and radiology, Biomedical engineeringpkt2, wsm8, ci42Radiation Oncology, Biomedical Informatics, Computer Science, RadiologyConferencesImplications for health and disease in the genetic signature of the Ashkenazi Jewish populationhttp://academiccommons.columbia.edu/catalog/ac:182749
Guha, Saurav; Rosenfeld, Jeffrey; Malhotra, Anil; Lee, Annette; Gregersen, Peter; Kane, John; Pe'er, Itsik; Darvasi, Ariel; Lencz, Toddhttp://dx.doi.org/10.7916/D85T3HTZMon, 08 Sep 2014 00:00:00 +0000Relatively small, reproductively isolated populations with reduced genetic diversity may have advantages for genomewide association mapping in disease genetics. The Ashkenazi Jewish population represents a unique population for study based on its recent (< 1,000 year) history of a limited number of founders, population bottlenecks and tradition of marriage within the community. We genotyped more than 1,300 Ashkenazi Jewish healthy volunteers from the Hebrew University Genetic Resource with the Illumina HumanOmni1-Quad platform. Comparison of the genotyping data with that of neighboring European and Asian populations enabled the Ashkenazi Jewish-specific component of the variance to be characterized with respect to disease-relevant alleles and pathways. Using clustering, principal components, and pairwise genetic distance as converging approaches, we identified an Ashkenazi Jewish-specific genetic signature that differentiated these subjects from both European and Middle Eastern samples. Most notably, gene ontology analysis of the Ashkenazi Jewish genetic signature revealed an enrichment of genes functioning in transepithelial chloride transport, such as CFTR, and in equilibrioception, potentially shedding light on cystic fibrosis, Usher syndrome and other diseases over-represented in the Ashkenazi Jewish population. Results also impact risk profiles for autoimmune and metabolic disorders in this population. Finally, residual intra-Ashkenazi population structure was minimal, primarily determined by class 1 MHC alleles, and not related to host country of origin. The Ashkenazi Jewish population is of potential utility in disease-mapping studies due to its relative homogeneity and distinct genomic signature. Results suggest that Ashkenazi-associated disease genes may be components of population-specific genomic differences in key functional pathways.Biostatistics, GeneticsComputer ScienceArticlesJISTIC: Identification of Significant Targets in Cancerhttp://academiccommons.columbia.edu/catalog/ac:183878
Sanchez-Garcia, Felix; Akavia, Uri David; Mozes, Eyal; Pe'er, Danahttp://dx.doi.org/10.7916/D85X279KMon, 08 Sep 2014 00:00:00 +0000Cancer is caused through a multistep process, in which a succession of genetic changes, each conferring a competitive advantage for growth and proliferation, leads to the progressive conversion of normal human cells into malignant cancer cells. Interrogation of cancer genomes holds the promise of understanding this process, thus revolutionizing cancer research and treatment. As datasets measuring copy number aberrations in tumors accumulate, a major challenge has become to distinguish between those mutations that drive the cancer versus those passenger mutations that have no effect. We present JISTIC, a tool for analyzing datasets of genome-wide copy number variation to identify driver aberrations in cancer. JISTIC is an improvement over the widely used GISTIC algorithm. We compared the performance of JISTIC versus GISTIC on a dataset of glioblastoma copy number variation, JISTIC finds 173 significant regions, whereas GISTIC only finds 103 significant regions. Importantly, the additional regions detected by JISTIC are enriched for oncogenes and genes involved in cell-cycle and proliferation. JISTIC is an easy-to-install platform independent implementation of GISTIC that outperforms the original algorithm detecting more relevant candidate genes and regions. The software and documentation are freely available and can be found at: http://www.c2b2.columbia.edu/danapeerlab/html/software.htmlOncology, Bioinformaticsfs2282, dp2315Computer Science, Biological SciencesArticlesAcoustic-Prosodic Entrainment in Human-Human and Human-Computer Dialoguehttp://academiccommons.columbia.edu/catalog/ac:177442
Levitan, Rivkahttp://dx.doi.org/10.7916/D8GT5KCHTue, 19 Aug 2014 00:00:00 +0000Entrainment (sometimes called adaptation or alignment) is the tendency of human speak- ers to adapt to or imitate characteristics of their interlocutors' behavior. This work focuses on entrainment on acoustic-prosodic features. Acoustic-prosodic entrainment has been ex- tensively studied but is not well understood. In particular, it is difficult to compare the results of different studies, since entrainment is usually measured in different ways, reflect- ing disparate conceptualizations of the phenomenon. In the first part of this thesis, we look for evidence of entrainment on a variety of acoustic-prosodic features according to various conceptualizations, and show that human speakers of both Standard American English and Mandarin Chinese entrain to each other globally and locally, in synchrony, and that this entrainment can be constant or convergent. We explore the relationship between entrain- ment and gender and show that entrainment on some acoustic-prosodic features is related to social behavior and dialogue coordination. In addition, we show that humans entrain in a novel domain, backchannel-inviting cues, and propose and test a novel hypothesis: that entrainment will be stronger in the case of an outlier feature value. In the second part of the thesis, we describe a method for flexibly and dynamically entraining a TTS voice to multiple acoustic-prosodic features of a user's input utterances, and show in an exploratory study that users prefer an entraining avatar to one that does not entrain, are more likely to ask its advice, and choose more positive adjectives to describe its voice. This work introduces a coherent view of entrainment in both familiar and novel domains. Our results add to the body of knowledge of entrainment in human-human conversations and propose new directions for making use of that knowledge to enhance human-computer interactions.Computer sciencerl2515Computer ScienceDissertationsLoqui Human-Human Dialogue Corpus (Transcriptions and Annotations)http://academiccommons.columbia.edu/catalog/ac:176612
Passonneau, Rebecca; Sachar, Evaneethttp://dx.doi.org/10.7916/D82R3PW9Fri, 15 Aug 2014 00:00:00 +0000The Loqui project investigated dialogue strategies that would depend less on accurate speech recognition and more on context. The testbed application, the CheckItOut dialogue system, was modeled on a corpus of telephone transactions between patrons and librarians that we collected at New York City’s Andrew Heiskell Braille & Talking Book Library in 2006. This data release contains transcriptions of the original telephone transactions of eighty two dialogues that were collected and used to inform the initial design of CheckItOut. It also contains annotations that capture dialogue acts, adjacency pairs (e.g., links between questions and their answers), discourse units, and specificity of referring expressions about the books under discussion.Computer sciencerp34Computer Science, Center for Computational Learning SystemsDatasetsAnalysis of trans eSNPs infers regulatory network architecturehttp://academiccommons.columbia.edu/catalog/ac:177106
Kreimer, Anathttp://dx.doi.org/10.7916/D8348HJRTue, 05 Aug 2014 00:00:00 +0000eSNPs are genetic variants associated with transcript expression levels. The characteristics of such variants highlight their importance and present a unique opportunity for studying gene regulation. eSNPs affect most genes and their cell type specificity can shed light on different processes that are activated in each cell. They can identify functional variants by connecting SNPs that are implicated in disease to a molecular mechanism. Examining eSNPs that are associated with distal genes can provide insights regarding the inference of regulatory networks but also presents challenges due to the high statistical burden of multiple testing. Such association studies allow: simultaneous investigation of many gene expression phenotypes without assuming any prior knowledge and identification of unknown regulators of gene expression while uncovering directionality. This thesis will focus on such distal eSNPs to map regulatory interactions between different loci and expose the architecture of the regulatory network defined by such interactions. We develop novel computational approaches and apply them to genetics-genomics data in human. We go beyond pairwise interactions to define network motifs, including regulatory modules and bi-fan structures, showing them to be prevalent in real data and exposing distinct attributes of such arrangements. We project eSNP associations onto a protein-protein interaction network to expose topological properties of eSNPs and their targets and highlight different modes of distal regulation. Overall, our work offers insights concerning the topological structure of human regulatory networks and the role genetics plays in shaping them.Bioinformaticsak2996Biomedical Informatics, Computer ScienceDissertationsProducing Trustworthy Hardware Using Untrusted Components, Personnel and Resourceshttp://academiccommons.columbia.edu/catalog/ac:175454
Waksman, Adamhttp://dx.doi.org/10.7916/D8N014PXMon, 07 Jul 2014 00:00:00 +0000Computer security is a full-system property, and attackers will always go after the weakest link in a system. In modern computer systems, the hardware supply chain is an obvious and vulnerable point of attack. The ever-increasing complexity of hardware systems, along with the globalization of the hardware supply chain, has made it unreasonable to trust hardware. Hardware-based attacks, known as backdoors, are easy to implement and can undermine the security of systems built on top of compromised hardware. Operating systems and other software can only be secure if they can trust the underlying hardware systems. The full supply chain for creating hardware includes multiple processes, which are often addressed in disparate threads of research, but which we consider as one unified process. On the front-end side, there is the soft design of hardware, along with validation and synthesis, to ultimately create a netlist, the document that defines the physical layout of hardware. On the back-end side, there is a physical fabrication process, where a chip is produced at a foundry from a supplied netlist, followed in some cases by post-fabrication testing. Producing a trustworthy chip means securing the process from the early design stages through to the post-fabrication tests. We propose, implement and analyze a series of methods for making the hardware supply chain resilient against a wide array of known and possible attacks. These methods allow for the design and fabrication of hardware using untrustworthy personnel, designs, tools and resources, while protecting the final product from large classes of attacks, some known previously and some discovered and taxonomized in this work. The overarching idea in this work is to take a full-process view of the hardware supply chain. We begin by securing the hardware design and synthesis processes uses a defense-in-depth approach. We combine this work with foundry-side techniques to prevent malicious modifications and counterfeiting, and finally apply novel attestation techniques to ensure that hardware is trustworthy when it reaches users. For our design-side security approach, we use defense-in-depth because in practice, any security method can potentially subverted, and defense-in-depth is the best way to handle that assumption. Our approach involves three independent steps. The first is a functional analysis tool (called FANCI), applied statically to designs during the coding and validation stages to remove any malicious circuits. The second step is to include physical security circuits that operate at runtime. These circuits, which we call trigger obfuscation circuits, scramble data at the microarchitectural level so that any hardware backdoors remaining in the design cannot be triggered at runtime. The third and final step is to include a runtime monitoring system that detects any backdoor payloads that might have been achieved despite the previous two steps. We design two different versions of this monitoring system. The first, TrustNet, is extremely lightweight and protects against an important class of attacks called emitter backdoors. The second, DataWatch, is slightly more heavyweight (though still efficient and low overhead) that can catch a wider variety of attacks and can be adapted to protect against nearly any type of digital payload. We taxonomize the types of attacks that are possible against each of the three steps of our defense-in-depth system and show that each defense provides strong coverage with low (or negligible) overheads to performance, area and power consumption. For our foundry-side security approach, we develop the first foundry-side defense system that is aware of design-side security. We create a power-based side-channel, called a beacon. This beacon is essentially a benign backdoor. It can be turned on by a special key (not provided to the foundry), allowing for security attestation during post-fabrication testing. By designing this beacon into the design itself, the beacon requires neither keys nor storage, and as such exists in the final chip purely by virtue of existing in the netlist. We further obfuscate the netlist itself, rendering the task of reverse engineering the beacon (for a foundry-side adversary) intractable. Both the inclusion of the beacon and the obfuscation process add little to area and power costs and have no impact on performance. All together, these methods provide a foundation on which hardware security can be developed and enhanced. They are low overhead and practical, making them suitable for inclusion in next generation hardware. Moving forward, the criticality of having trustworthy hardware can only increase. Ensuring that the hardware supply chain can be trusted in the face of sophisticated adversaries is vital. Both hardware design and hardware fabrication are increasingly international processes, and we believe continuing with this unified approach is the correct path for future research. In order for companies and governments to place trust in mission-critical hardware, it is necessary for hardware to be certified as secure and trustworthy. The methods we propose can be the first steps toward making this certification a reality.Computer scienceasw2118Computer ScienceDissertationsAccelerating Similarly Structured Datahttp://academiccommons.columbia.edu/catalog/ac:175516
Wu, Lisa K.http://dx.doi.org/10.7916/D8PR7T46Mon, 07 Jul 2014 00:00:00 +0000The failure of Dennard scaling [Bohr, 2007] and the rapid growth of data produced and consumed daily [NetApp, 2012] have made mitigating the dark silicon phenomena [Esmaeilzadeh et al., 2011] and providing fast computation for processing large volumes and expansive variety of data while consuming minimal energy the utmost important challenges for modern computer architecture. This thesis introduces the concept that grouping data structures that are previously defined in software and processing them with an accelerator can significantly improve the application performance and energy efficiency. To measure the potential performance benefits of this hypothesis, this research starts out by examining the cache impacts on accelerating commonly used data structures and its applicability to popular benchmarks. We found that accelerating similarly structured data can provide substantial benefits, however, most popular benchmark suites do not contain shared acceleration targets and therefore cannot obtain significant performance or energy improvements via a handful of accelerators. To further examine this hypothesis in an environment where the common data structures are widely used, we choose to target database application domain, using tables and columns as the similarly structured data, accelerating the processing of such data, and evaluate the performance and energy efficiency. Given that data partitioning is widely used for database applications to improve cache locality, we architect and design a streaming data partitioning accelerator to assess the feasibility of big data acceleration. The results show that we are able to achieve an order of magnitude improvement in partitioning performance and energy. To improve upon the present ad-hoc communications between accelerators and general-purpose processors [Vo et al., 2013], we also architect and evaluate a streaming framework that can be used for the data parti- tioner and other streaming accelerators alike. The streaming framework can provide at least 5 GB/s per stream per thread using software control, and is able to elegantly handle interrupts and context switches using a simple save/restore. As a final evaluation of this hypothesis, we architect a class of domain-specific database processors, or Database Processing Units (DPUs), to further improve the performance and energy efficiency of database applications. As a case study, we design and implement one DPU, called Q100, to execute industry standard analytic database queries. Despite Q100's sensitivity to communication bandwidth on-chip and off-chip, we find that the low-power configuration of Q100 is able to provide three orders of magnitude in energy efficiency over a state of the art software Database Management System (DBMS), while the high-performance configuration is able to outperform the same DBMS by 70X. Based on these experiments, we conclude that grouping similarly structured data and processing it with accelerators vastly improve application performance and energy efficiency for a given application domain. This is primarily due to the fact that creating specialized encapsulated instruction and data accesses and datapaths allows us to mitigate unnecessary data movement, take advantage of data and pipeline parallelism, and consequently provide substantial energy savings while obtaining significant performance gains.Computer science, Computer engineeringlkw2115Computer ScienceDissertationsAnalytic Methods in Concrete Complexityhttp://academiccommons.columbia.edu/catalog/ac:176848
Tan, Li-Yanghttp://dx.doi.org/10.7916/D8251GB1Mon, 07 Jul 2014 00:00:00 +0000This thesis studies computational complexity in concrete models of computation. We draw on a range of mathematical tools to understand the structure of Boolean functions, with analytic methods — Fourier analysis, probability theory, and approximation theory — playing a central role. These structural theorems are leveraged to obtain new computational results, both algorithmic upper bounds and complexity-theoretic lower bounds, in property testing, learning theory, and circuit complexity. We establish the best-known upper and lower bounds on the classical problem of testing whether an unknown Boolean function is monotone. We prove an Ω ̃(n^1/5) lower bound on the query complexity of non- adaptive testers, an exponential improvement over the previous lower bound of Ω(logn) from 2002. We complement this with an O ̃(n^5/6)-query non-adaptive algorithm for the problem. We characterize the statistical query complexity of agnostically learning Boolean functions with respect to product distributions. We show that l_1-approximability by low- degree polynomials, known to be sufficient for efficient learning in this setting, is in fact necessary. As an application we establish an optimal lower bound showing that no statistical query algorithm can efficiently agnostically learn monotone k-juntas for any k = ω(1) and any constant error less than 1/2. We initiate a systematic study of the tradeoffs be- tween accuracy and efficiency in Boolean circuit complexity, focusing on disjunctive normal form formulas, among the most basic types of circuits. A conceptual message that emerges is that the landscape of circuit complexity changes dramatically, both qualitatively and quantitatively, when the formula is only required to approximate a function rather than compute it exactly. Finally we consider the Fourier Entropy-Influence Conjecture, a long- standing open problem in the analysis of Boolean functions with significant applications in learning theory, the theory of pseudorandomness, and random graph theory. We prove a composition theorem for the conjecture, broadly expanding the class of functions for which the conjecture is known to be true.Computer scienceComputer ScienceDissertationsPopulation Genetics of Identity By Descenthttp://academiccommons.columbia.edu/catalog/ac:175990
Palamara, Pier Francescohttp://dx.doi.org/10.7916/D8V122XTMon, 07 Jul 2014 00:00:00 +0000Recent improvements in high-throughput genotyping and sequencing technologies have afforded the collection of massive, genome-wide datasets of DNA information from hundreds of thousands of individuals. These datasets, in turn, provide unprecedented opportunities to reconstruct the history of human populations and detect genotype-phenotype association. Recently developed computational methods can identify long-range chromosomal segments that are identical across samples, and have been transmitted from common ancestors that lived tens to hundreds of generations in the past. These segments reveal genealogical relationships that are typically unknown to the carrying individuals. In this work, we demonstrate that such identical-by-descent (IBD) segments are informative about a number of relevant population genetics features: they enable the inference of details about past population size fluctuations, migration events, and they carry the genomic signature of natural selection. We derive a mathematical model, based on coalescent theory, that allows for a quantitative description of IBD sharing across purportedly unrelated individuals, and develop inference procedures for the reconstruction of recent demographic events, where classical methodologies are statistically underpowered. We analyze IBD sharing in several contemporary human populations, including representative communities of the Jewish Diaspora, Kenyan Maasai samples, and individuals from several Dutch provinces, in all cases retrieving evidence of fine-scale demographic events from recent history. Finally, we expand the presented model to describe distributions for those sites in IBD shared segments that harbor mutation events, showing how these may be used for the inference of mutation rates in humans and other species.Genetics, Computer science, Statisticspp2314Computer ScienceDissertationsNext Generation Emergency Call System with Enhanced Indoor Positioninghttp://academiccommons.columbia.edu/catalog/ac:175653
Song, Wonsanghttp://dx.doi.org/10.7916/D8QJ7FGCMon, 07 Jul 2014 00:00:00 +0000The emergency call systems in the United States and elsewhere are undergoing a transition from the PSTN-based legacy system to a new IP-based system. The new system is referred to as the Next Generation 9-1-1 (NG9-1-1) or NG112 system. We have built a prototype NG9-1-1 system which features media convergence and data integration that are unavailable in the current emergency calling system. The most important piece of information in the NG9-1-1 system is the caller's location. The caller's location is used for routing the call to the appropriate call center. The emergency responders use the caller's location to find the caller. Therefore, it is essential to determine the caller's location as precisely as possible to minimize delays in emergency response. Delays in response may result in loss of lives. When a person makes an emergency call outdoors using a mobile phone, the Global Positioning System (GPS) can provide the caller's location accurately. Indoor positioning, however, presents a challenge. GPS does not generally work indoors because satellite signals do not penetrate most buildings. Moreover, there is an important difference between determining location outdoors and indoors. Unlike outdoors, vertical accuracy is very important in indoor positioning because an error of few meters will send emergency responders to a different floor in a building, which may cause a significant delay in reaching the caller. This thesis presents a way to augment our NG9-1-1 prototype system with a new indoor positioning system. The indoor positioning system focuses on improving the accuracy of vertical location. Our goal is to provide floor-level accuracy with minimum infrastructure support. Our approach is to use a user's smartphone to trace her vertical movement inside buildings. We utilize multiple sensors available in today's smartphones to enhance positioning accuracy. This thesis makes three contributions. First, we present a hybrid architecture for floor localization with emergency calls in mind. The architecture combines beacon-based infrastructure and sensor-based dead reckoning, striking a balance between accurately determining a user's location and minimizing the required infrastructure. Second, we present the elevator module for tracking a user's movement in an elevator. The elevator module addresses three core challenges that make it difficult to accurately derive displacement from acceleration. Third, we present the stairway module which determines the number of floors a user has traveled on foot. Unlike previous systems that track users' foot steps, our stairway module uses a novel landing counting technique. Additionally, this thesis presents our work on designing and implementing an NG9-1-1 prototype system. We first demonstrate how emergency calls from various call origination devices are identified, routed to the proper Public Safety Answering Point (PSAP) based on the caller's location, and terminated by the call taker software at the PSAP. We then show how text communications such as Instant Messaging and Short Message Service can be integrated into the NG9-1-1 architecture. We also present GeoPS-PD, a polygon simplification algorithm designed to improve the performance of location-based routing. GeoPS-PD reduces the size of a polygon, which represents the service boundary of a PSAP in the NG9-1-1 system.Computer sciencews2131Computer ScienceDissertationsMASC Word Sense Sentence Corpus, Crowdsourced subsethttp://academiccommons.columbia.edu/catalog/ac:175161
Passonneau, Rebecca; Carpenter, Bobhttp://dx.doi.org/10.7916/D80P0X5BFri, 27 Jun 2014 00:00:00 +0000The MASC Word Sense Sentence corpus, Crowdsourced subset, is distributed as a set of three *tsv files (tab-separated format) that contain the sentences, annotation labels, and WordNet senses of the corpus. For 45 of the 116 words used from the original MASC Word Sense Sentence corpus (http://dx.doi.org/10.7916/D80V89XH), there are up to 1000 sentences per word drawn from the heterogeneous MASC corpus, with sense labels from WordNet. Each sentence exemplifies at least one MASC word, annotated for its WordNet sense. Each word/sentence pair has up to 25 crowdsourced sense labels collected on Amazon Mechanical Turk.Computer sciencerp34Computer Science, Center for Computational Learning SystemsDatasetsMASC Word Sense Sentence Corpus, tab-separated formathttp://academiccommons.columbia.edu/catalog/ac:175063
Passonneau, Rebecca; Ide, Nancy; Baker, Collin; Fellbaum, Christiane; Xie, Boyihttp://dx.doi.org/10.7916/D80V89XHTue, 24 Jun 2014 00:00:00 +0000Synopsis: The MASC Word Sense Sentence corpus is distributed as a set of three *tsv files (tab-separated format) that contain the sentences, annotation labels, and senses that comprise the sentence corpus: (1) the annotation labels (masc_annotations.tsv), (2) the WordNet word senses (masc_senses.tsv), and (3) the word token-sentence pairs, or instances (masc_sentences.tsv). A total of 116 distinct lemmas were selected; for each lemma, approximately 1000 example sentences were taken from the MASC corpus; and for each word in its sentence context, a trained annotator assigned a WordNet sense (WordNet version 3.1) as the annotation label. The following README describes the data in detail.Computer sciencerp34, bx2109Computer Science, Center for Computational Learning SystemsDatasetsUnsupervised Anomaly-based Malware Detection using Hardware Featureshttp://academiccommons.columbia.edu/catalog/ac:174968
Tang, Adrian; Sethumadhavan, Simha; Stolfo, Salvatorehttp://dx.doi.org/10.7916/D8H1304ZTue, 17 Jun 2014 00:00:00 +0000Recent works have shown promise in using microarchitectural execution patterns to detect malware programs. These detectors belong to a class of detectors known as signature-based detectors as they catch malware by comparing a program's execution pattern (signature) to execution patterns of known malware programs. In this work, we propose a new class of detectors - anomaly-based hardware malware detectors - that do not require signatures for malware detection, and thus can catch a wider range of malware including potentially novel ones. We use unsupervised machine learning to build profiles of normal program execution based on data from performance counters, and use these profiles to detect significant deviations in program behavior that occur as a result of malware exploitation. We show that real-world exploitation of popular programs such as IE and Adobe PDF Reader on a Windows/x86 platform can be detected with nearly perfect certainty. We also examine the limits and challenges in implementing this approach in face of a sophisticated adversary attempting to evade anomaly-based detection. The proposed detector is complementary to previously proposed signature-based detectors and can be used together to improve security.Computer sciencess3418, sjs11Computer ScienceTechnical reportsA Red Team/Blue Team Assessment of Functional Analysis Methods for Malicious Circuit Identificationhttp://academiccommons.columbia.edu/catalog/ac:174971
Waksman, Adam; Rajendran, Jeyavijayan; Suozzo, Matthew Robert; Sethumadhavan, Simhahttp://dx.doi.org/10.7916/D87H1GQZTue, 17 Jun 2014 00:00:00 +0000Recent advances in hardware security have led to the development of FANCI (Functional Analysis for Nearly-Unused Circuit Identification), an analysis tool that identifies stealthy, malicious circuits within hardware designs that can perform malicious backdoor behavior. Evaluations of such tools against benchmarks and academic attacks are not always equivalent to the dynamic attack scenarios that can arise in the real world. For this reason, we apply a red team/blue team approach to stress-test FANCI's abilities to efficiently detect malicious backdoor circuits within hardware designs. In the Embedded Systems Challenge (ESC) 2013, teams from research groups from multiple continents created designs with malicious backdoors hidden in them as part of a red team effort to circumvent FANCI. Notably, these backdoors were not placed into a priori known designs. The red team was allowed to create arbitrary, unspecified designs. Two interesting results came out of this effort. The first was that FANCI was surprisingly resilient to this wide variety of attacks and was not circumvented by any of the stealthy backdoors created by the red teams. The second result is that frequent-action backdoors, which are backdoors that are not made stealthy, were often successful. These results emphasize the importance of combining FANCI with a reasonable degree of validation testing. The blue team efforts also exposed some aspects of the FANCI prototype that make analysis time-consuming in some cases, which motivates further development of the prototype in the future.Computer scienceasw2118, ms4249, ss3418Computer ScienceTechnical reportsEnergy Exchanges: Internal Power Oversight for Applicationshttp://academiccommons.columbia.edu/catalog/ac:174983
Kambadur, Melanie Rae; Kim, Martha Allenhttp://dx.doi.org/10.7916/D8KS6PQ6Tue, 17 Jun 2014 00:00:00 +0000This paper introduces energy exchanges, a set of abstractions that allow applications to help hardware and operating systems manage power and energy consumption. Using annotations, energy exchanges dictate when, where, and how to trade performance or accuracy for power in ways that only an application's developer can decide. In particular, the abstractions offer audits and budgets which watch and cap the power or energy of some piece of the application. The interface also exposes energy and power usage reports which an application may use to change its behavior. Such information complements existing system-wide energy management by operating systems or hardware, which provide global fairness and protections, but are unaware of the internal dynamics of an application. Energy exchanges are implemented as a user-level C++ library. The library employs an accounting technique to attribute shares of system-wide energy consumption (provided by system-wide hardware energy meters available on newer hardware platforms) to individual application threads. With these per-thread meters and careful tracking of an application's activity, the library exposes energy and power usage for program regions of interest via the energy exchange abstractions with negligible runtime or power overhead. We use the library to demonstrate three applications of energy exchanges: (1) the prioritization of a mobile game's energy use over third-party advertisements, (2) dynamic adaptations of the framerate of a video tracking benchmark that maximize performance and accuracy within the confines of a given energy allotment, and (3) the triggering of computational sprints and corresponding cooldowns, based on time, system TDP, and power consumption.Computer sciencemrd2142, mak2191Computer ScienceTechnical reportsA Convergence Study of Multimaterial Mesh-based Surface Trackinghttp://academiccommons.columbia.edu/catalog/ac:174989
Da, Fang; Batty, Christopher; Grinspun, Eitanhttp://dx.doi.org/10.7916/D8B8568VTue, 17 Jun 2014 00:00:00 +0000We report the results from experiments on the convergence of the multimaterial mesh-based surface tracking method introduced by the same authors. Under mesh refinement, approximately first order convergence or higher in L1 and L2 is shown for vertex positions, face normals and non-manifold junction curves in a number of scenarios involving the new operations proposed in the method.Computer sciencefd2263, eg2173Computer ScienceTechnical reportsMysterious Checks from Mauborgne to Fabyanhttp://academiccommons.columbia.edu/catalog/ac:174993
Bellovin, Steven Michaelhttp://dx.doi.org/10.7916/D82R3PSZTue, 17 Jun 2014 00:00:00 +0000It has long been known that George Fabyan's Riverbank Laboratories provided the U.S. military with cryptanalytic and training services during World War~I. The relationship has always be seen as voluntary. Newly discovered evidence suggests that Fabyan was in fact paid, at least in part, for his services.Computer sciencesmb2132Computer ScienceTechnical reportsPhosphor: Illuminating Dynamic Data Flow in the JVMhttp://academiccommons.columbia.edu/catalog/ac:174980
Bell, Jonathan Schaffer; Kaiser, Gail E.http://dx.doi.org/10.7916/D8QJ7FFXTue, 17 Jun 2014 00:00:00 +0000Dynamic taint analysis is a well-known information flow analysis problem with many possible applications. Taint tracking allows for analysis of application data flow by assigning labels to inputs, and then propagating those labels through data flow. Taint tracking systems traditionally compromise among performance, precision, accuracy, and portability. Performance can be critical, as these systems are typically intended to be deployed with software, and hence must have low overhead. To be deployed in security-conscious settings, taint tracking must also be accurate and precise. Dynamic taint tracking must be portable in order to be easily deployed and adopted for real world purposes, without requiring recompilation of the operating system or language interpreter, and without requiring access to application source code. We present Phosphor, a dynamic taint tracking system for the Java Virtual Machine (JVM) that simultaneously achieves our goals of performance, accuracy, precision, and portability. Moreover, to our knowledge, it is the first portable general purpose taint tracking system for the JVM. We evaluated Phosphor's performance on two commonly used JVM languages (Java and Scala), on two versions of two commonly used JVMs (Oracle's HotSpot and OpenJDK's IcedTea) and on Android's Dalvik Virtual Machine, finding its performance to be impressive: as low as 3% (53% on average), using the DaCapo macro benchmark suite. This paper describes the approach that Phosphor uses to achieve portable taint tracking in the JVM.Computer sciencejsb2125, gek1Computer ScienceTechnical reportsEnhancing Security by Diversifying Instruction Setshttp://academiccommons.columbia.edu/catalog/ac:174977
Sinha, Kanad; Kemerlis, Vasileios; Pappas, Vasileios; Sethumadhavan, Simha; Keromytis, Angelos D.http://dx.doi.org/10.7916/D8V69GQGTue, 17 Jun 2014 00:00:00 +0000Despite the variety of choices regarding hardware and software, to date a large number of computer systems remain identical. Characteristic examples of this trend are Windows on x86 and Android on ARM. This homogeneity, sometimes referred to as â€œcomputing oligoculture", provides a fertile ground for malware in the highly networked world of today. One way to counter this problem is to diversify systems so that attackers cannot quickly and easily compromise a large number of machines. For instance, if each system has a different ISA, the attacker has to invest more time in developing exploits that run on every system manifestation. It is not that each individual attack gets harder, but the spread of malware slows down. Further, if the diversified ISA is kept secret from the attacker, the bar for exploitation is raised even higher. In this paper, we show that system diversification can be realized by enabling diversity at the lowest hardware/software interface, the ISA, with almost zero performance overhead. We also describe how prac- tical development and deployment problems of diversified systems can be handled easily in the context of popular software distrbution models, such as the mobile app store model. We demonstrate our proposal with an OpenSPARC FPGA prototype.Computer scienceks2935, vk2209, vp2214, ss3418, ak2052Computer ScienceTechnical reportsVernam, Mauborgne, and Friedman: The One-Time Pad and the Index of Coincidencehttp://academiccommons.columbia.edu/catalog/ac:175002
Bellovin, Steven Michaelhttp://dx.doi.org/10.7916/D8Z0369CTue, 17 Jun 2014 00:00:00 +0000The conventional narrative for the invention of the AT and T one-time pad was related by David Kahn. Based on the evidence available in the AT and T patent files and from interviews and correspondence, he concluded that Gilbert Vernam came up with the need for randomness, while Joseph Mauborgne realized the need for a non-repeating key. Examination of other documents suggests a different narrative. It is most likely that Vernam came up with the need for non-repetition; Mauborgne, though, apparently contributed materially to the invention of the two-tape variant. Furthermore, there is reason to suspect that he suggested the need for randomness to Vernam. However, neither Mauborgne, Herbert Yardley, nor anyone at AT and T really understood the security advantages of the true one-time tape. Col. Parker Hitt may have; William Friedman definitely did. Finally, we show that Friedman's attacks on the two-tape variant likely led to his invention of the index of coincidence, arguably the single most important publication in the history of cryptanalysis.Computer sciencesmb2132Computer ScienceTechnical reportsModel Aggregation for Distributed Content Anomaly Detectionhttp://academiccommons.columbia.edu/catalog/ac:175005
Whalen, Sean; Boggs, Nathaniel Gordon; Stolfo, Salvatorehttp://dx.doi.org/10.7916/D8TB151TTue, 17 Jun 2014 00:00:00 +0000Cloud computing offers a scalable, low-cost, and resilient platform for critical applications. Securing these applications against attacks targeting unknown vulnerabilities is an unsolved challenge. Network anomaly detection addresses such zero-day attacks by modeling attributes of attack-free application traffic and raising alerts when new traffic deviates from this model. Content anomaly detection (CAD) is a variant of this approach that models the payloads of such traffic instead of higher level attributes. Zero-day attacks then appear as outliers to properly trained CAD sensors. In the past, CAD was unsuited to cloud environments due to the relative overhead of content inspection and the dynamic routing of content paths to geographically diverse sites. We challenge this notion and introduce new methods for efficiently aggregating content models to enable scalable CAD in dynamically-pathed environments such as the cloud. These methods eliminate the need to exchange raw content, drastically reduce network and CPU overhead, and offer varying levels of content privacy. We perform a comparative analysis of our methods using Random Forest, Logistic Regression, and Bloom Filter-based classifiers for operation in the cloud or other distributed settings such as wireless sensor networks. We find that content model aggregation offers statistically significant improvements over non-aggregate models with minimal overhead, and that distributed and non-distributed CAD have statistically indistinguishable performance. Thus, these methods enable the practical deployment of accurate CAD sensors in a distributed attack detection infrastructure.Computer sciencengb2113, sjs11Computer ScienceTechnical reportsTeaching Microarchitecture through Metaphorshttp://academiccommons.columbia.edu/catalog/ac:174974
Eum, Julianna; Sethumadhavan, Simhahttp://dx.doi.org/10.7916/D83R0R0NTue, 17 Jun 2014 00:00:00 +0000Students traditionally learn microarchitecture by studying textual descriptions with diagrams but few analogies. Several popular textbooks on this topic introduce concepts such as pipelining and caching in the context of simple paper-only architectures. While this instructional style allows important concepts to be covered within a given class period, students have difficulty bridging the gap between what is covered in classes and real-world implementations. Discussing concrete implementations and complications would, however, take too much time. In this paper, we propose a technique of representing microarchitecture building blocks with animated metaphors to accelerate the process of learning about complex microarchitectures. We represent hardware implementations as road networks that include specific patterns of traffic flow found in microarchitectural behavior. Our experiences indicate an 83% improvement to understanding memory system microarchitecture. We believe the mental models developed by these students will serve them in remembering microarchitectural behavior and extend to learning new microarchitectures more easily.Computer sciencess3418Computer ScienceTechnical reportsThe Economics of Cyberwarhttp://academiccommons.columbia.edu/catalog/ac:174986
Bellovin, Steven Michaelhttp://dx.doi.org/10.7916/D8G15Z06Tue, 17 Jun 2014 00:00:00 +0000Cyberwar is very much in the news these days. It is tempting to try to understand the economics of such an activity, if only qualitatively. What effort is required? What can such attacks accomplish? What does this say, if anything, about the likelihood of cyberwar?Computer sciencesmb2132Computer ScienceTechnical reportsSchur Complement Trick for Positive Semi-definite Energieshttp://academiccommons.columbia.edu/catalog/ac:175008
Jacobson, Alec S.http://dx.doi.org/10.7916/D8JS9NKTTue, 17 Jun 2014 00:00:00 +0000The Schur complement trick appears sporadically in numerical optimization methods [Schur 1917; Cottle 1974]. The trick is especially useful for solving Lagrangian saddle point problems when minimizing quadratic energies subject to linear equality constraints [Gill et al. 1987]. Typically, to apply the trick, the energy's Hessian is assumed positive definite. I generalize this technique for positive semi-definite Hessians.Computer scienceasj2141Computer ScienceTechnical reportsExploring Societal Computing based on the Example of Privacyhttp://academiccommons.columbia.edu/catalog/ac:175011
Sheth, Swapneelhttp://dx.doi.org/10.7916/D8F18WVDTue, 17 Jun 2014 00:00:00 +0000Data privacy when using online systems like Facebook and Amazon has become an increasingly popular topic in the last few years. This thesis will consist of the following four projects that aim to address the issues of privacy and software engineering. First, only a little is known about how users and developers perceive privacy and which concrete measures would mitigate their privacy concerns. To investigate privacy requirements, we conducted an online survey with closed and open questions and collected 408 valid responses. Our results show that users often reduce privacy to security, with data sharing and data breaches being their biggest concerns. Users are more concerned about the content of their documents and their personal data such as location than about their interaction data. Unlike users, developers clearly prefer technical measures like data anonymization and think that privacy laws and policies are less effective. We also observed interesting differences between people from different geographies. For example, people from Europe are more concerned about data breaches than people from North America. People from Asia/Pacific and Europe believe that content and metadata are more critical for privacy than people from North America. Our results contribute to developing a user-driven privacy framework that is based on empirical evidence in addition to the legal, technical, and commercial perspectives. Second, a related challenge to above, is to make privacy more understandable in complex systems that may have a variety of user interface options, which may change often. As social network platforms have evolved, the ability for users to control how and with whom information is being shared introduces challenges concerning the configuration and comprehension of privacy settings. To address these concerns, our crowd sourced approach simplifies the understanding of privacy settings by using data collected from 512 users over a 17 month period to generate visualizations that allow users to compare their personal settings to an arbitrary subset of individuals of their choosing. To validate our approach we conducted an online survey with closed and open questions and collected 59 valid responses after which we conducted follow-up interviews with 10 respondents. Our results showed that 70% of respondents found visualizations using crowd sourced data useful for understanding privacy settings, and 80% preferred a crowd sourced tool for configuring their privacy settings over current privacy controls. Third, as software evolves over time, this might introduce bugs that breach users' privacy. Further, there might be system-wide policy changes that could change users' settings to be more or less private than before. We present a novel technique that can be used by end-users for detecting changes in privacy, i.e., regression testing for privacy. Using a social approach for detecting privacy bugs, we present two prototype tools. Our evaluation shows the feasibility and utility of our approach for detecting privacy bugs. We highlight two interesting case studies on the bugs that were discovered using our tools. To the best of our knowledge, this is the first technique that leverages regression testing for detecting privacy bugs from an end-user perspective. Fourth, approaches to addressing these privacy concerns typically require substantial extra computational resources, which might be beneficial where privacy is concerned, but may have significant negative impact with respect to Green Computing and sustainability, another major societal concern. Spending more computation time results in spending more energy and other resources that make the software system less sustainable. Ideally, what we would like are techniques for designing software systems that address these privacy concerns but which are also sustainable - systems where privacy could be achieved "for free", i.e., without having to spend extra computational effort. We describe how privacy can indeed be achieved for free an accidental and beneficial side effect of doing some existing computation - in web applications and online systems that have access to user data. We show the feasibility, sustainability, and utility of our approach and what types of privacy threats it can mitigate. Finally, we generalize the problem of privacy and its tradeoffs. As Social Computing has increasingly captivated the general public, it has become a popular research area for computer scientists. Social Computing research focuses on online social behavior and using artifacts derived from it for providing recommendations and other useful community knowledge. Unfortunately, some of that behavior and knowledge incur societal costs, particularly with regards to Privacy, which is viewed quite differently by different populations as well as regulated differently in different locales. But clever technical solutions to those challenges may impose additional societal costs, e.g., by consuming substantial resources at odds with Green Computing, another major area of societal concern. We propose a new crosscutting research area, Societal Computing, that focuses on the technical tradeoffs among computational models and application domains that raise significant societal issues. We highlight some of the relevant research topics and open problems that we foresee in Societal Computing. We feel that these topics, and Societal Computing in general, need to gain prominence as they will provide useful avenues of research leading to increasing benefits for society as a whole.Computer sciencesks2142Computer ScienceDissertationsProsodic Entrainment in Mandarin and English: A Cross-Linguistic Comparisonhttp://academiccommons.columbia.edu/catalog/ac:174889
Xia, Zhihua; Levitan, Rivka; Hirschberg, Julia Bellhttp://dx.doi.org/10.7916/D8F47M84Fri, 13 Jun 2014 00:00:00 +0000Entrainment is the propensity of speakers to begin behaving like one another in conversation. We identify evidence of entrainment in a number of acoustic and prosodic dimensions in conversational speech of Standard American English speakers and Mandarin Chinese speakers. We compare entrainment in the Columbia Games Corpus and the Tongji Games Corpus and find similar patterns of global and local entrainment in both. Differences appear primarily in global convergence.Computer science, Linguisticsrl2515, jbh2019Computer ScienceTechnical reportsSelected machine learning reductionshttp://academiccommons.columbia.edu/catalog/ac:172841
Choromanska, Anna Ewahttp://dx.doi.org/10.7916/D8QF8QZ9Fri, 11 Apr 2014 00:00:00 +0000Machine learning is a field of science aiming to extract knowledge from the data. Optimization lies in the core of machine learning as many learning problems are formulated as optimization problems, where the goal is to minimize/maximize an objective function. More complex machine learning problems are then often solved by reducing them to simpler sub-problems solvable by known optimization techniques. This dissertation addresses two elements of the machine learning system 'pipeline', designing efficient basic optimization tools tailored to solve specific learning problems, or in other words optimize a specific objective function, and creating more elaborate learning tools with sub-blocks being essentially optimization solvers equipped with such basic optimization tools. In the first part of this thesis we focus on a very specific learning problem where the objective function, either convex or non-convex, involves the minimization of the partition function, the normalizer of a distribution, as is the case in conditional random fields (CRFs) or log-linear models. Our work proposes a tight quadratic bound on the partition function which parameters are easily recovered by a simple algorithm that we propose. The bound gives rise to the family of new optimization learning algorithms, based on bound majorization (we developed batch, both full-rank and low-rank, and semi-stochastic variants), with linear convergence rate that successfully compete with state-of-the-art techniques (among them gradient descent methods, Newton and quasi-Newton methods like L-BFGS, etc.). The only constraint we introduce is on the number of classes which is assumed to be finite and enumerable. The bound majorization method we develop is simultaneously the first reduction scheme discussed in this thesis, where throughout this thesis by 'reduction' we understand the learning approach or algorithmic technique converting a complex machine learning problem into a set of simpler problems (that can be as small as a single problem). Secondly, we focus on developing two more sophisticated machine learning tools, for solving harder learning problems. The tools that we develop are built from basic optimization sub-blocks tailored to solve simpler optimization sub-problems. We first focus on the multi class classification problem where the number of classes is very large. We reduce this problem to a set of simpler sub-problems that we solve using basic optimization methods performing additive update on the parameter vector. Secondly we address the problem of learning data representation when the data is unlabeled for any classification task. We reduce this problem to a set of simpler sub-problems that we solve using basic optimization methods, however this time the parameter vector is updated multiplicatively. In both problems we assume that the data come in a stream that can even be infinite. We will now provide more specific description of each of these problems and describe our approach for solving them. In the multi class classification problem it is desirable to achieve train and test running times which are logarithmic in the label complexity. The existing approaches to this problem are either intractable or do not adapt well to the data. We propose a reduction of this problem to a set of binary regression problems organized in a tree structure and introduce a new splitting criterion (objective function) allowing gradient descent style optimization (bound optimization methods can also be used). A decision tree algorithm that we obtain differs from traditional decision trees in the objective optimized, and in how that optimization is done. The different objective has useful properties, such us it guarantees balanced and small-error splits, while the optimization uses an online learning algorithm that is queried and trained simultaneously as we pass over the data. Furthermore, we prove an upper-bound on the number of splits required to reduce the entropy of the tree leafs below small threshold. We empirically show that the trees we obtain have logarithmic depth, which implies logarithmic training and testing running times, and significantly smaller error than random trees. Finally, we consider the problem of unsupervised (clustering) learning of data representation, where the quality of obtained clustering is measured using a very simple, intuitive and widely cited clustering objective, k-means clustering objective. We introduce a family of online clustering algorithms by extending algorithms for online supervised learning, with access to expert predictors (which are basic sub-blocks of our learning system), to the unsupervised learning setting. The parameter vector corresponds to the probability distribution over the experts. Different update rules for the parameter vector depend on an approximation to the current value of the k-means clustering objective obtained by each expert, and model different levels of non-stationarity in the data. We show that when the experts are batch clustering algorithms with approximation guarantees with respect to the k-means clustering objective, applied to a sliding window of the data stream, our algorithms obtain approximation guarantees with respect to the k-means clustering objective. Thus simultaneously we address an open problem posed by Dasgupta for approximating k-means clustering objective on data streams. We experimentally show that our algorithms' empirical performance tracks that of the best clustering algorithm in its experts set and that our algorithms outperform widely used online algorithms.Computer scienceaec2163Computer Science, Electrical EngineeringDissertationsOvercoming the Intuition Wall: Measurement and Analysis in Computer Architecturehttp://academiccommons.columbia.edu/catalog/ac:171206
Demme, John Davidhttp://dx.doi.org/10.7916/D8X0652NFri, 28 Feb 2014 00:00:00 +0000These are exciting times for computer architecture research. Today there is significant demand to improve the performance and energy-efficiency of emerging, transformative applications which are being hammered out by the hundreds for new computing platforms and usage models. This booming growth of applications and the variety of programming languages used to create them is challenging our ability as architects to rapidly and rigorously characterize these applications. Concurrently, hardware has become more complex with the emergence of accelerators, multicore systems, and heterogeneity caused by further divergence between processor market segments. No one architect can now understand all the complexities of many systems and reason about the full impact of changes or new applications. To that end, this dissertation presents four case studies in quantitative methods. Each case study attacks a different application and proposes a new measurement or analytical technique. In each case study we find at least one surprising or unintuitive result which would likely not have been found without the application of our method.Computer sciencejddComputer ScienceDissertationsTraffic Analysis Attacks and Defenses in Low Latency Anonymous Communicationhttp://academiccommons.columbia.edu/catalog/ac:171233
Chakravarty, Sambuddhohttp://dx.doi.org/10.7916/D8MK69Z1Fri, 28 Feb 2014 00:00:00 +0000The recent public disclosure of mass surveillance of electronic communication, involving powerful government authorities, has drawn the public's attention to issues regarding Internet privacy. For almost a decade now, there have been several research efforts towards designing and deploying open source, trustworthy and reliable systems that ensure users' anonymity and privacy. These systems operate by hiding the true network identity of communicating parties against eavesdropping adversaries. Tor, acronym for The Onion Router, is an example of such a system. Such systems relay the traffic of their users through an overlay of nodes that are called Onion Routers and are operated by volunteers distributed across the globe. Such systems have served well as anti-censorship and anti-surveillance tools. However, recent publications have disclosed that powerful government organizations are seeking means to de-anonymize such systems and have deployed distributed monitoring infrastructure to aid their efforts. Attacks against anonymous communication systems, like Tor, often involve trac analysis. In such attacks, an adversary, capable of observing network traffic statistics in several different networks, correlates the trac patterns in these networks, and associates otherwise seemingly unrelated network connections. The process can lead an adversary to the source of an anonymous connection. However, due to their design, consisting of globally distributed relays, the users of anonymity networks like Tor, can route their traffic virtually via any network; hiding their tracks and true identities from their communication peers and eavesdropping adversaries. De-anonymization of a random anonymous connection is hard, as the adversary is required to correlate traffic patterns in one network link to those in virtually all other networks. Past research mostly involved reducing the complexity of this process by rst reducing the set of relays or network routers to monitor, and then identifying the actual source of anonymous traffic among network connections that are routed via this reduced set of relays or network routers to monitor. A study of various research efforts in this field reveals that there have been many more efforts to reduce the set of relays or routers to be searched than to explore methods for actually identifying an anonymous user amidst the network connections using these routers and relays. Few have tried to comprehensively study a complete attack, that involves reducing the set of relays and routers to monitor and identifying the source of an anonymous connection. Although it is believed that systems like Tor are trivially vulnerable to traffic analysis, there are various technical challenges and issues that can become obstacles to accurately identifying the source of anonymous connection. It is hard to adjudge the vulnerability of anonymous communication systems without adequately exploring the issues involved in identifying the source of anonymous traffic. We take steps to ll this gap by exploring two novel active trac analysis attacks, that solely rely on measurements of network statistics. In these attacks, the adversary tries to identify the source of an anonymous connection arriving to a server from an exit node. This generally involves correlating traffic entering and leaving the Tor network, linking otherwise unrelated connections. To increase the accuracy of identifying the victim connection among several connections, the adversary injects a traffic perturbation pattern into a connection arriving to the server from a Tor node, that the adversary wants to de-anonymize. One way to achieve this is by colluding with the server and injecting a traffic perturbation pattern using common traffic shaping tools. Our first attack involves a novel remote bandwidth estimation technique to conrm the identity of Tor relays and network routers along the path connecting a Tor client and a server by observing network bandwidth fluctuations deliberately injected by the server. The second attack involves correlating network statistics, for connections entering and leaving the Tor network, available from existing network infrastructure, such as Cisco's NetFlow, for identifying the source of an anonymous connection. Additionally, we explored a novel technique to defend against the latter attack. Most research towards defending against traffic analysis attacks, involving transmission of dummy traffic, have not been implemented due to fears of potential performance degradation. Our novel technique involves transmission of dummy traffic, consisting of packets with IP headers having small Time-to-Live (TTL) values. Such packets are discarded by the routers before they reach their destination. They distort NetFlow statistics, without degrading the client's performance. Finally, we present a strategy that employs transmission of unique plain-text decoy traffic, that appears sensitive, such as fake user credentials, through Tor nodes to decoy servers under our control. Periodic tallying of client and server logs to determine unsolicited connection attempts at the server is used to identify the eavesdropping nodes. Such malicious Tor node operators, eavesdropping on users' traffic, could be potential traffic analysis attackers.Computer scienceComputer ScienceDissertationsVMVM: Unit Test Virtualization for Java (System Implementation)http://academiccommons.columbia.edu/catalog/ac:171118
Bell, Jonathan Schaffer; Kaiser, Gailhttp://dx.doi.org/10.7916/D89C6VF2Thu, 27 Feb 2014 00:00:00 +0000Testing large software packages can become very time intensive. To address this problem, researchers have investigated techniques such as Test Suite Minimization. Test Suite Minimization reduces the number of tests in a suite by removing tests that appear redundant, at the risk of a reduction in fault-finding ability since it can be difficult to identify which tests are truly redundant. We take a completely different approach to solving the same problem of long running test suites by instead reducing the time needed to execute each test, an approach that we call Unit Test Virtualization. With Unit Test Virtualization, we reduce the overhead of isolating each unit test with a lightweight virtualization container. We describe the empirical analysis that grounds our approach and provide an implementation of Unit Test Virtualization targeting Java applications. We evaluated our implementation, VMVM, using 20 real-world Java applications and found that it reduces test suite execu! tion time by up to 97% (on average, 62%) when compared to traditional unit test execution. We also compared VMVM to a well known Test Suite Minimization technique, finding the reduction provided by VMVM to be four times greater, while still executing every test with no loss of fault-finding ability. This archive contains the implementation for VMVM.Computer sciencejsb2125, gek1Computer ScienceComputer softwareEnabling the Virtual Phones to remotely sense the Real Phones in real-time: A Sensor Emulation initiative for virtualized Android-x86http://academiccommons.columbia.edu/catalog/ac:171022
Santhanam, Raghavanhttp://dx.doi.org/10.7916/D8Z899FFWed, 26 Feb 2014 00:00:00 +0000Smartphones nowadays have the ground-breaking features that were only a figment of one’s imagination. For the ever-demanding cellphone users, the exhaustive list of features that a smartphone supports just keeps getting more exhaustive with time. These features aid one’s personal and professional uses as well. Extrapolating into the future the features of a present-day smartphone, the lives of us humans using smartphones are going to be unimaginably agile. With the above said emphasis on the current and future potential of a smartphone, the ability to virtualize smartphones with all their real-world features into a virtual platform, is a boon for those who want to rigorously experiment and customize the virtualized smartphone hardware without spending an extra penny. Once virtualizable independently on a larger scale, the idea of virtualized smartphones with all the virtualized pieces of hardware takes an interesting turn with the sensors being virtualized in a way that’s closer to the real-world behavior. When accessible remotely with the real-time responsiveness, the above mentioned real-world behavior will be a real dealmaker in many real-world systems, namely, the life-saving systems like the ones that instantaneously get alerts about harmful magnetic radiations in the deep mining areas, etc. And these life-saving systems would be installed on a large scale on the desktops or large servers as virtualized smartphones having the added support of virtualized sensors which remotely fetch the real hardware sensor readings from a real smartphone in real-time. Based on these readings the lives working in the affected areas can be alerted and thus saved by the people who are operating the at the desktops or large servers hosting the virtualized smartphones.Computer scienceComputer ScienceTechnical reportsApproximating the Bethe partition functionhttp://academiccommons.columbia.edu/catalog/ac:171018
Weller, Adrian; Jebara, Tonyhttp://dx.doi.org/10.7916/D8M043F6Fri, 21 Feb 2014 00:00:00 +0000When belief propagation (BP) converges, it does so to a stationary point of the Bethe free energy F, and is often strikingly accurate. However, it may converge only to a local optimum or may not converge at all. An algorithm was recently introduced for attractive binary pairwise MRFs which is guaranteed to return an ϵ-approximation to the global minimum of F in polynomial time provided the maximum degree Δ=O(logn), where n is the number of variables. Here we significantly improve this algorithm and derive several results including a new approach based on analyzing first derivatives of F, which leads to performance that is typically far superior and yields a fully polynomial-time approximation scheme (FPTAS) for attractive models without any degree restriction. Further, the method applies to general (non-attractive) models, though with no polynomial time guarantee in this case, leading to the important result that approximating log of the Bethe partition function, logZB=−minF, for a general model to additive ϵ-accuracy may be reduced to a discrete MAP inference problem. We explore an application to predicting equipment failure on an urban power network and demonstrate that the Bethe approximation can perform well even when BP fails to converge.Computer scienceaw2506, tj2008Computer ScienceTechnical reportsTowards A Dynamic QoS-aware Over-The-Top Video Streaming in LTEhttp://academiccommons.columbia.edu/catalog/ac:170979
Nam, Hyunwoo; Kim, Kyung Hwa; Kim, Bong Ho; Calin, Doru; Schulzrinne, Henninghttp://dx.doi.org/10.7916/D80863BSFri, 21 Feb 2014 00:00:00 +0000We present a study of traffic behavior of two popular over-the-top (OTT) video streaming services (YouTube and Netflix). Our analysis is conducted on different mobile devices (iOS and Android) over various wireless networks (Wi-Fi, 3G and LTE) under dynamic network conditions. Our measurements show that the video players frequently discard a large amount of video content although it is successfully delivered to a client. We first investigate the root cause of this unwanted behavior. Then, we propose a Quality-of-Service (QoS)-aware video streaming architecture in Long Term Evolution (LTE) networks to reduce the waste of network resource and improve user experience. The architecture includes a selective packet discarding mechanism, which can be placed in packet data network gateways (P-GW). In addition, our QoS-aware rules assist video players in selecting an appropriate resolution under a fluctuating channel condition. We monitor network condition and configure QoS parameters to control availability of the maximum bandwidth in real time. In our experimental setup, the proposed platform shows up to 20.58% improvement in saving downlink bandwidth and improves user experience by reducing buffer underflow period to an average of 32 seconds.Computer sciencehn2203, kk2515, dc2686, hgs10Computer Science, Electrical EngineeringTechnical reportsHybrid Continuous-Discrete Computer: from ISA to Microarchitecturehttp://academiccommons.columbia.edu/catalog/ac:170982
Huang, Yipeng; Sethumadhavan, Simhahttp://dx.doi.org/10.7916/D8VH5KV3Fri, 21 Feb 2014 00:00:00 +0000In this project, we design an instruction set architecture for a proposed hybrid continuous-discrete computer (HCDC) chip. The ISA harnesses the microarchitectural features and analog circuitry provided in the hardware. We describe the workloads that are suitable for the HCDC architecture. The underlying microarchitecture for the HCDC chip, including its controllers, datapaths, and interfaces to analog and digital functional units are specified in detail.Computer scienceyh2315Computer ScienceTechnical reportsTowards Dynamic Network Condition-Aware Video Server Selection Algorithms over Wireless Networkshttp://academiccommons.columbia.edu/catalog/ac:170975
Nam, Hyunwoo; Kim, Kyung Hwa; Schulzrinne, Henning; Calin, Doruhttp://dx.doi.org/10.7916/D8416V3KFri, 21 Feb 2014 00:00:00 +0000We investigate video server selection algorithms in a distributed video-on-demand system. We conduct a detailed study of the YouTube Content Delivery Network (CDN) on PCs and mobile devices over Wi-Fi and 3G networks under varying network conditions. We proved that a location-aware video server selection algorithm assigns a video content server based on the network attachment point of a client. We found out that such distance-based algorithms carry the risk of directing a client to a less optimal content server, although there may exist other better performing video delivery servers. In order to solve this problem, we propose to use dynamic network information such as packet loss rates and Round Trip Time (RTT)between an edge node of an wireless network (e.g., an Internet Service Provider (ISP) router in a Wi-Fi network and a Radio Network Controller (RNC) node in a 3G network) and video content servers, to find the optimal video content server when a video is requested. Our empirical study shows that the proposed architecture can provide higher TCP performance, leading to better viewing quality compared to location-based video server selection algorithms.Computer sciencehn2203, kk2515, hgs10, dc2686Computer Science, Electrical EngineeringTechnical reportsA Gameful Approach to Teaching Software Design and Software Testing - Assignments and Questshttp://academiccommons.columbia.edu/catalog/ac:170985
Sheth, Swapneel; Bell, Jonathan Schaffer; Kaiser, Gailhttp://dx.doi.org/10.7916/D8QR4V4SFri, 21 Feb 2014 00:00:00 +0000Introductory CS classes typically do not focus on software testing. A lot of students’ mental model when they start learning programming is that “if it compiles and runs without crashing, it must work fine.” Despite numerous attempts to introduce testing early in CS programs and many known benefits to inculcating good testing habits early in one’s programming life, students remain averse to software testing as there is low student interest in software testing. To address this problem, we used an internally developed research system called HALO — “Highly Addictive sociaLly Optimized Software Engineering”. Our previous work describes early prototypes of HALO; in this paper, we describe how we used it for the CS2 class and the feedback from real users. HALO uses game-like elements and motifs from popular games like World of Warcraft to make the whole software engineering process and in particular, the software testing process, more engaging and social. HALO is not a game; it leverages game mechanics and applies them to the software development process. For example, in HALO, students are given a number of “quests” that they need to complete. These quests are used to disguise standard software testing techniques like white and black box testing, unit testing, and boundary value analysis. Upon completing these quests, the students get social rewards in the form of achievements, titles, and experience points. They can see how they are doing compared to other students in the class. While the students think that they are competing just for points and achievements, the primary benefit of such a system is that the students’ code gets tested a lot better than it normally would have.Computer sciencesks2142, jsb2125, gek1Computer ScienceTechnical reportsGeometric Control of Human Stem Cell Morphology and Differentiationhttp://academiccommons.columbia.edu/catalog/ac:169751
Vunjak-Novakovic, Gordana; Wan, Leo Q.; Kang, Sylvia M.; Eng, George; Grayson, Warren L.; Lu, Xin; Huo, Bo; Gimble, Jeffrey; Guo, Xiang-Dong Edward; Mow, Van C.http://dx.doi.org/10.7916/D89P2ZM4Fri, 31 Jan 2014 00:00:00 +0000During tissue morphogenesis, stem cells and progenitor cells migrate, proliferate, and differentiate, with striking changes in cell shape, size, and acting mechanical stresses. The local cellular function depends on the spatial distribution of cytokines as well as local mechanical microenvironments in which the cells reside. In this study, we controlled the organization of human adipose derived stem cells using micro-patterning technologies, to investigate the influence of multi-cellular form on spatial distribution of cellular function at an early stage of cell differentiation. The underlying role of cytoskeletal tension was probed through drug treatment. Our results show that the cultivation of stem cells on geometric patterns resulted in pattern- and position-specific cell morphology, proliferation and differentiation. The highest cell proliferation occurred in the regions with large, spreading cells (such as the outer edge of a ring and the short edges of rectangles). In contrast, stem cell differentiation co-localized with the regions containing small, elongated cells (such as the inner edge of a ring and the regions next to the short edges of rectangles). The application of drugs that inhibit the formation of actomyosin resulted in the lack of geometrically specific differentiation patterns. This study confirms the role of substrate geometry on stem cell differentiation, through associated physical forces, and provides a simple and controllable system for studying biophysical regulation of cell function.Biomedical engineeringgv2131, gme2103, xl2402, exg1, vcm1Computer Science, Medicine, Biomedical EngineeringArticlesN Heads Are Better Than Nonehttp://academiccommons.columbia.edu/catalog/ac:167858
Hopkins, Morris; Casteneda, Mauricio; Sheth, Swapneel Kalpesh; Kaiser, Gail E.http://dx.doi.org/10.7916/D8028PFVWed, 27 Nov 2013 00:00:00 +0000Social network platforms have transformed how people communicate and share information. However, as these platforms have evolved, the ability for users to control how and with whom information is being shared introduces challenges concerning the configuration and comprehension of privacy settings. To address these concerns, our crowd sourced approach simplifies the understanding of privacy settings by using data collected from 512 users over a 17 month period to generate visualizations that allow users to compare their personal settings to an arbitrary subset of individuals of their choosing. To validate our approach we conducted an online survey with closed and open questions and collected 59 valid responses after which we conducted follow-up interviews with 10 respondents. Our results showed that 70% of respondents found visualizations using crowd sourced data useful for understanding privacy settings, and 80% preferred a crowd sourced tool for configuring their privacy settings over current privacy controls.Computer sciencemah2250, mc3683, sks2142, gek1Computer ScienceTechnical reportsHeterogeneous Access: Survey and Design Considerationshttp://academiccommons.columbia.edu/catalog/ac:167865
Addepalli, Sateesh; Schulzrinne, Henning G.; Singh, Amandeep; Ormazabal, Gastonhttp://dx.doi.org/10.7916/D8QJ7F8PWed, 27 Nov 2013 00:00:00 +0000As voice, multimedia, and data services are converging to IP, there is a need for a new networking architecture to support future innovations and applications. Users are consuming Internet services from multiple devices that have multiple network interfaces such as Wi-Fi, LTE, Bluetooth, and possibly wired LAN. Such diverse network connectivity can be used to increase both reliability and performance by running applications over multiple links, sequentially for seamless user experience, or in parallel for bandwidth and performance enhancements. The existing networking stack, however, offers almost no support for intelligently exploiting such network, device, and location diversity. In this work, we survey recently proposed protocols and architectures that enable heterogeneous networking support. Upon evaluation, we abstract common design patterns and propose a unified networking architecture that makes better use of a heterogeneous dynamic environment, both in terms of networks and devices. The architecture enables mobile nodes to make intelligent decisions about how and when to use each or a combination of networks, based on access policies. With this new architecture, we envision a shift from current applications, which support a single network, location, and device at a time to applications that can support multiple networks, multiple locations, and multiple devices.Computer sciencehgs10Computer ScienceTechnical reportsUs and Them - A Study of Privacy Requirements Across North America, Asia, and Europehttp://academiccommons.columbia.edu/catalog/ac:167855
Sheth, Swapneel Kalpesh ; Kaiser, Gail E.http://dx.doi.org/10.7916/D8028PFVWed, 27 Nov 2013 00:00:00 +0000Data privacy when using online systems like Facebook and Amazon has become an increasingly popular topic in the last few years. However, only a little is known about how users and developers perceive privacy and which concrete measures would mitigate privacy concerns. To investigate privacy requirements, we conducted an online survey with closed and open questions and collected 408 valid responses. Our results show that users often reduce privacy to security, with data sharing and data breaches being their biggest concerns. Users are more concerned about the content of their documents and personal data such as location than their interaction data. Unlike users, developers clearly prefer technical measures like data anonymization and think that privacy laws and policies are less effective. We also observed interesting differences between people from different geographies. For example, people from Europe are more concerned about data breaches than people from North America. People from Asia/Pacific and Europe believe that content and metadata are more critical for privacy than people from North America. Our results contribute to developing a user-driven privacy framework that is based on empirical evidence in addition to the legal, technical, and commercial perspectives.Computer sciencesks2142, gek1Computer ScienceTechnical reportsFunctioning Hardware from Functional Programshttp://academiccommons.columbia.edu/catalog/ac:167862
Edwards, Stephen A.http://dx.doi.org/10.7916/D8V9860DWed, 27 Nov 2013 00:00:00 +0000To provide high performance at practical power levels, tomorrow’s chips will have to consist primarily of application-specific logic that is only powered on when needed. His paper discusses synthesizing such logic from the functional language Haskell. He proposed approach, which consists of rewriting steps that ultimately dismantle the source program into a simple dialect that enables a syntax-directed translation to hardware, enables aggressive parallelization and the synthesis of application-specific distributed memory systems. Transformations include scheduling arithmetic operations onto specific data paths, replacing recursion with iteration, and improving data locality by inlining recursive types. A compiler based on these principles is under development.Computer sciencese2007Computer ScienceTechnical reportsCorrelating Visual Speaker Gestures with Measures of Audience Engagement to Aid Video Browsinghttp://academiccommons.columbia.edu/catalog/ac:167022
Zhang, Johnhttp://hdl.handle.net/10022/AC:P:22145Thu, 07 Nov 2013 00:00:00 +0000In this thesis, we argue that in the domains of educational lectures and political debates, speaker gestures can be a source of semantic cues for video browsing. We hypothesize that certain human gestures, which can be automatically identified through techniques of computer vision, can convey significant information that are correlated to audience engagement. We present a joint-angle descriptor derived from an automatic upper body pose estimation framework to train an SVM which identifies point and spread poses in extracted video frames of an instructor giving a lecture. Ground-truth is collected in the form of 2500 manually annotated frames covering 20 minutes of a video lecture. Cross validation on the ground-truth data showed classifier F-scores of 0.54 and 0.39 for point and spread poses, respectively. We also derive an attribute for gestures which measures the angular variance of the arm movements from this system (analogous to arm waving). We present a method for tracking hands which succeeds even when left and right hands are clasping and occluding each other. We evaluate on a ground-truth dataset of 698 images with 1301 annotated left and right hands, mostly clasped. Our method performs better than baseline on recall (0.66 vs. 0.53) without sacrificing precision (0.65 for both) toward the goal of recognizing clasped hands. For tracking, it results in an improvement over a baseline method with an F-score of 0.59 vs. 0.48. From this, we are able to derive hand motion-based gesture attributes such as velocity, direction change and extremal pose. In ground-truth studies, we manually annotate and analyze the gestures of two instructors, each in a 75-minute computer science lecture using a 14-bit pose vector. We observe "pedagogical" gestures of punctuation and encouragement in addition to traditional classes of gestures such as deictic and metaphoric. We also introduce a tool to facilitate the manual annotations of gestures in video and present results on their frequencies and co-occurrences. In particular, we find that 5 poses represent 80% of the variation in the annotated ground truth. We demonstrate a correlation between the angular variance of arm movements and the presence of those conjunctions that are used to contrast connected clauses ("but", "neither", etc.) in the accompanying speech. We do this by training an AdaBoost-based binary classifier using decision trees as weak learners. On a ground-truth database of 4243 video clips totaling 3.83 hours, each with subtitles, training on sets of conjunctions indicating contrast produces classifiers capable of achieving 55% accuracy on a balanced test set. We study two different presentation methods: an attribute graph which shows a normalized measure of the visual attributes across an entire video, as well as emphasized subtitles, where individual words are emphasized (resized) based on their accompanying gestures. Results from 12 subjects show supportive ratings given for the browsing aids in the task of providing keywords for video under time constraints. Subjects' keywords are also compared to independent ground-truth, resulting in precisions from 0.50-0.55, even when given less than half real time to view the video. We demonstrate a correlation between gesture attributes and a rigorous method of measuring audience engagement: electroencephalography (EEG). Our 20 subjects watch 61 minutes of video of the 2012 U.S. Presidential Debates while under observation through EEG. After discarding corrupted recordings, we retain 47 minutes worth of EEG data for each subject. The subjects are examined in aggregate and in subgroups according to gender and political affiliation. We find statistically significant correlations between gesture attributes (particularly extremal pose) and our feature of engagement derived from EEG. For all subjects watching all videos, we see a statistically significant correlation between gesture and engagement with a Spearman rank correlation of rho = 0.098 with p < 0.05, Bonferroni corrected. For some stratifications, correlations reach as high as rho = 0.297. From these results, we conclude what gestures can be used to measure engagement.Computer science, Communicationjrz2106Computer ScienceDissertationsUser Interfaces for Patient-Centered Communication of Health Status and Care Progresshttp://academiccommons.columbia.edu/catalog/ac:178477
Wilcox-Patterson, Laurenhttp://hdl.handle.net/10022/AC:P:22143Thu, 07 Nov 2013 00:00:00 +0000The recent trend toward patients participating in their own healthcare has opened up numerous opportunities for computing research. This dissertation focuses on how technology can foster this participation, through user interfaces to effectively communicate personal health status and care progress to hospital patients. I first characterize the design space for electronic information communication to patients through field studies conducted in multiple hospital settings. These studies utilize a combination of survey instruments, and low- and high-fidelity prototypes, including a document-editing prototype through which users can view and manage clinical data to automatically associate it with progress notes. The prototype, activeNotes, includes the first known techniques supporting clinical information requests directly within a document editor. A usage study with ICU physicians at New York-Presbyterian Hospital (NYP) substantiated our design and revealed how electronic information related to patient status and care progress is derived from a typical Electronic Health Record system. Insights gained from this study informed following studies to understand how to design abstracted, plain-language views suitable for patients. We gauged both patient and physician responses to information display prototypes deployed in patient rooms for a formative study exploring their design. Following my reports on this study, I discuss the design, development and pilot evaluations of a prototype Personal Health Record application providing live, abstracted clinical information for patients at NYP. The portal, evaluated by cardiothoracic surgery patients, is the first of its kind to allow patients to capture and monitor live data related to their care. Patient use of the portal influenced the subsequent design of tools to support users in making sense of online medication information. These tools, designed with nurses and pharmacists and evaluated by cardiothoracic surgery patients at NYP, were developed using topic modeling approaches and text analysis techniques. Embodied in a prototype called Remedy, they enable rapid filtering and comparison of medication-related search results, based on a number of website features and content topics. I conclude by discussing how findings from this series of studies can help shape the ongoing design and development of patient-centered technology.Computer science, Health care managementlgw23Computer ScienceDissertationsDesign of Scalable On-Demand Video Streaming Systems Leveraging Video Viewing Patternshttp://academiccommons.columbia.edu/catalog/ac:166939
Hwang, Kyung-Wookhttp://hdl.handle.net/10022/AC:P:22120Mon, 04 Nov 2013 00:00:00 +0000The explosive growth in on-demand access of video across all forms of delivery (Internet, traditional cable, IPTV, wireless) has renewed the interest in scalable delivery methods. Approaches using Content Delivery Networks (CDNs), Peer-to-Peer (P2P) approaches, and their combinations have been proposed as viable options to ease the load on servers and network links. However, there has been little focus on how to take advantage of user viewing patterns to understand their impact on existing mechanisms and to design new solutions that improve the streaming service quality. In this dissertation, we leverage on the observation that users watch only a small portion of videos to understand the limits of existing designs and to optimize two scalable approaches -- the content placement and P2P Video-on-Demand (VoD) streaming. Then, we present our novel scalable system called Joint-Family which enables adaptive bitrate streaming (ABR) in P2P VoD, supporting user viewing patterns. We first provide evidence of such user viewing behavior from data collected from a nationally deployed VoD service. In contrast to using a simplistic popularity-based placement and traditionally proposed caching strategies (such as CDNs), we use a Mixed Integer Programming formulation to model the placement problem and employ an innovative approach that scales well. We have performed detailed simulations using actual traces of user viewing sessions (including stream control operations such as pause, fast-forward, and rewind). Our results show that the use of segment-based placement strategy yields substantial savings in both disk storage requirements at origin servers/VHOs as well as network bandwidth use. For example, compared to a simple caching scheme using full videos, our MIP-based placement using segments can achieve up to 71% reduction in peak link bandwidth usage. Secondly, we note that the policies adopted in existing P2P VoD systems have not taken user viewing behavior -- that users abandon videos -- into account. We show that abandonment can result in increased interruptions and wasted resources. As a result, we reconsider the set of policies to use in the presence of abandonment. Our goal is to balance the conflicting needs of delivering videos without interruptions while minimizing wastage. We find that an Earliest-First chunk selection policy in conjunction with the Earliest-Deadline peer selection policy allows us to achieve high download rates. We take advantage of abandonment by converting peers to "partial seeds"; this increases capacity. We minimize wastage by using a playback lookahead window. We use analysis and simulation experiments using real-world traces to show the effectiveness of our approach. Finally, we propose Joint-Family, a protocol that combines P2P and adaptive bitrate (ABR) streaming for VoD. While P2P for VoD and ABR have been proposed previously, they have not been studied together because they attempt to tackle problems with seemingly orthogonal goals. We motivate our approach through analysis that overcomes a misconception resulting from prior analytical work, and show that the popularity of a P2P swarm and seed staying time has a significant bearing on the achievable per-receiver download rate. Specifically, our analysis shows that popularity affects swarm efficiency when seeds stay "long enough". We also show that ABR in a P2P setting helps viewers achieve higher playback rates and/or fewer interruptions. We develop the Joint-Family protocol based on the observations from our analysis. Peers in Joint-Family simultaneously participate in multiple swarms to exchange chunks of different bitrates. We adopt chunk, bitrate, and peer selection policies that minimize occurrence of interruptions while delivering high quality video and improving the efficiency of the system. Using traces from a large-scale commercial VoD service, we compare Joint-Family with existing approaches for P2P VoD and show that viewers in Joint-Family enjoy higher playback rates with minimal interruption, irrespective of video popularity.Computer science, Computer engineering, Electrical engineeringComputer Science, Electrical EngineeringDissertationsAnnotation Guidelines for Arabic Nominal Gender, Number, and Rationalityhttp://academiccommons.columbia.edu/catalog/ac:166671
Habash, Nizar Y.; Alkuhlani, Sarah M.http://hdl.handle.net/10022/AC:P:22028Tue, 29 Oct 2013 00:00:00 +0000The annotation task we define here is focused on information relevant to modeling Arabic nominal gender and number computationally. First we define the various facts regarding number and gender in Modern Standard Arabic and then we present the task guidelines and examples.Computer science, Information science, Linguisticsnh2142, sma2149Computer Science, Center for Computational Learning SystemsTechnical reportsOptimal Order and Efficiency for Iterations with Two Evaluationshttp://academiccommons.columbia.edu/catalog/ac:166436
Kung, H.T.; Traub, Josephe F.http://hdl.handle.net/10022/AC:P:21979Thu, 10 Oct 2013 00:00:00 +0000The problem is to calculate a simple zero of a nonlinear function f. We consider rational iterations without memory which use two evaluations of f or its derivatives. It is shown that the optimal order is 2. This settles a conjecture of Kung and Traub that an iteration using n evaluations without memory is of order at most 2ⁿ⁻¹, for the case n=2. Furthermore we show that any rational two-evaluation iteration of optimal order must use either two evaluations of f or one evaluation of f and one of f'. From this result we completely settle the question of the optimal efficiency, in our efficiency measure, for any two-evaluation iteration without memory. Depending on the relative cost of evaluating f and f', the optimal efficiency is achieved by either Newton iteration or the iteration ᴪ.Mathematicsjft2Computer ScienceArticlesAlgorithms for Solvents of Matrix Polynomialshttp://academiccommons.columbia.edu/catalog/ac:166439
Dennis Jr., J.E.; Traub, Joseph F.; Weber, R.P.http://hdl.handle.net/10022/AC:P:21980Thu, 10 Oct 2013 00:00:00 +0000In an earlier paper we developed the algebraic theory of matrix polynomials. Here we introduce two algorithms for computing "dominant" solvents. Global convergence of the algorithms under certain conditions is established.Mathematicsjft2Computer ScienceArticlesA Three-State Algorithm for Real Polynomials Using Quadratic Iterationhttp://academiccommons.columbia.edu/catalog/ac:166445
Jenkins, M.A.; Traub, Joseph F.http://hdl.handle.net/10022/AC:P:21982Thu, 10 Oct 2013 00:00:00 +0000We introduce a new three-stage process for calculating the zeros of a polynomial with real coefficients. The algorithm finds either a linear or quadratic factor, working completely in real arithmetic. In the third stage the algorithm uses one of two variable-shift iterations corresponding to the linear or quadratic case. The iteration for a linear factor is a real arithmetic version of the third stage of the algorithm for complex polynomials which we studied in an earlier paper. A new variable-shift iteration is introduced in this paper which is suitable for quadratic factors. If the complex algorithm and the new real algorithm are applied to the same real polynomial, then the real algorithm is about four times as fast. We prove that the mathematical algorithm always converges and show that the rate of convergence of the third stage is faster than second order. The problem and algorithm may be recast into matrix form. The third stage is a quadratic form of shifted inverse powering and a quadratic form of generalized Rayleigh iteration. The results of extensive testing are summarized. For an ALGOL W program run on an IBM 360/67 we found that for polynomials ranging in degree from 20 to 50, the time required to calculate all zeros averaged 2n² milliseconds. An ALGOL 60 implementation of the algorithm and a program which calculates a posteriors bounds on the zeros may be found in Jenkins’ 1969 Stanford dissertation.Mathematics, Computer sciencejft2Computer ScienceArticlesA Class of Globally Convergent Iterations for the Solution of Polynomial Equationshttp://academiccommons.columbia.edu/catalog/ac:166448
Traub, Joseph F.http://hdl.handle.net/10022/AC:P:21983Thu, 10 Oct 2013 00:00:00 +0000We introduce a class of new iteration functions which are ratios of polynomials of the same degree and hence defined at infinity. The poles of these rational functions occur at points which cause no difficulty. The classical iteration functions are given as explicit functions of P and its derivatives. The new iteration functions are constructed according to a certain algorithm. This construction requires only simple polynomial manipulation which may be performed on a computer. We shall treat here only the important case that the zeros of P are distinct and that the dominant zero is real. The extension to multiple zeros, dominant complex zeros, and sub-dominant zeros will be given in another paper. We shall restrict ourselves to questions relevant to the calculation of zeros. Certain aspects of our investigations which are of broader interest will be reported elsewhere.Mathematics, Computer sciencejft2Computer ScienceArticlesVariational Calculations of Energy and Fine Structure for the 2³P State of Heliumhttp://academiccommons.columbia.edu/catalog/ac:166463
Traub, Joseph F.http://hdl.handle.net/10022/AC:P:21988Thu, 10 Oct 2013 00:00:00 +0000ERV accurate calculations of the energies of the 1s2 1S and 1s2s3S states of helium have been made recently. It is now possible to consider the contributions of relativistic and electrodynamic effects to the observed ionization energies in a meaningful way. These small effects must still be added to the large calculated nonrelativistic energy, however, before a comparison with experiment can be made. In the 3P states, the spin-dependent part of the theoretical electromagnetic interaction of the two electrons can be compared directly with observed fine-structure splittings. Recent improvements in the experimental accuracy of the fine-structure measurements, 45 particularly the direct observation of the splittings as radio-frequency transitions, have made such a comparison possible to an order of accuracy including higher order electrodynamic corrections to the usual fine structure formulas.Chemistry, Molecular physicsjft2Computer ScienceArticlesConstruction of Globally Convergent Iteration Functions for the Solution of Polynomial Equationshttp://academiccommons.columbia.edu/catalog/ac:166454
Traub, Joseph F.http://hdl.handle.net/10022/AC:P:21985Thu, 10 Oct 2013 00:00:00 +0000Iteration functions for the approximation of zeros of a polynomial P are usually given as explicit functions of P and its derivatives. We introduce a class of iteration functions which are themselves constructed according to a certain algorithm given below. The construction of the iteration functions requires only simple polynomial manipulation which may be performed on a computer.Mathematicsjft2Computer ScienceArticlesAssociated Polynomials and Uniform Methods for the Solution of Linear Problemshttp://academiccommons.columbia.edu/catalog/ac:166451
Traub, Joseph F.http://hdl.handle.net/10022/AC:P:21984Thu, 10 Oct 2013 00:00:00 +0000To every polynomial P of degree n we associate a sequence of n-1 polynomials of increasing degree which we call the associated polynomials of P. The associated polynomials depend in a particularly simple way on the coefficients of P. These polynomials have appeared in many guises in the literature, usually related to some particular application and most often going unrecognized. They have been called Horner polynomials and Laguerre polynomials. Often what occurs is not an associated polynomial itself but a number which is an associated polynomial evaluated at a zero of P. The properties of associated polynomials have never been investigated in themselves. We shall try to demonstrate that associated polynomials provide a useful unifying concept. Although many of the results of this paper are new, we shall also present known results in our framework.Mathematicsjft2Computer ScienceArticlesGeneralized Sequences with Applications to the Discrete Calculushttp://academiccommons.columbia.edu/catalog/ac:166457
Traub, Joseph F.http://hdl.handle.net/10022/AC:P:21986Thu, 10 Oct 2013 00:00:00 +0000Mikusinski [17] has introduced a theory of generalized functions which is algebraic in nature. Generalized functions are introduced in a way which is analogous to the extension of the concept of number from integers to rationals. In this paper, an analogous theory of "generalized sequences" is constructed for the discrete calculus. This theory serves a dual purpose. It provides a rigorous foundation for an operational calculus and provides a powerful formalism for the solution of discrete problems.Mathematicsjft2Computer ScienceArticlesVariational Calculations of the 2³S State of Heliumhttp://academiccommons.columbia.edu/catalog/ac:166466
Traub, Joseph F.http://hdl.handle.net/10022/AC:P:21989Thu, 10 Oct 2013 00:00:00 +0000With a 12-parameter Hylleraas-type wave function containing only positive powers, a new calculation has been carried out for the 2³S state of helium by the Ritz variational principle. The energy was minimized by a descent process. A nonrelativistic energy of -1.0876088 Hylleraas units was reached as compared with the best previously published value of -1.0876015 Hylleraas units from a 6-parameter function. When masspolarization and a2Ry corrections are included, the 12-parameter function gives an ionization potential 2.52 cm-1 less than the experimental value of 38 454.64 cm-1. The electron density at the nucleus is also calculated and compared with the experimental hyperfine-spectrum value. All numerical work was carried out on an I.B.M. 650 computer.Chemistry, Molecular chemistryjft2Computer ScienceArticlesThe Algebraic Theory of Matrix Polynomialshttp://academiccommons.columbia.edu/catalog/ac:166433
Dennis Jr., J.E.; Traub, Joseph F.; Weber, R.P.http://hdl.handle.net/10022/AC:P:21978Thu, 10 Oct 2013 00:00:00 +0000A matrix S is a solvent of the matrix polynomial M(X)=A₀Xᵐ +...+ Am if M(S)=O where A, X, and S are square matrices. In this paper we develop the algebraic theory of matrix polynomials and solvents. We define division and interpolation, investigate the properties of block Vandermonde matrices, and define and study the existence of a complete set of solvents. We study the relation between the matrix polynomial problem and the lambda-matrix problem, which is to find a scalar A₀λᵐ + A₁λᵐ⁻¹ +...+ Am is singular. In a future paper we extend Traub’s algorithm for calculating zeros of scalar polynomials to matrix polynomials and establish global convergence properties of this algorithm for a class of matrix polynomials.Mathematicsjft2Computer ScienceArticlesComputational Complexity of Iterative Processeshttp://academiccommons.columbia.edu/catalog/ac:166442
Traub, Joseph F.http://hdl.handle.net/10022/AC:P:21981Thu, 10 Oct 2013 00:00:00 +0000The theory of optimal algorithmic processes is part of computational complexity. This paper deals with analytic computational complexity. The relation between the goodness of an iteration algorithm and its new function evaluation and memory requirements are analyzed. A new conjecture is stated.Computer science, Mathematicsjft2Computer ScienceArticlesOn Lagrange-Hermite Interpolationhttp://academiccommons.columbia.edu/catalog/ac:166460
Traub, Joseph F.http://hdl.handle.net/10022/AC:P:21987Thu, 10 Oct 2013 00:00:00 +0000Mathematics, Applied mathematicsjft2Computer ScienceArticlesAggregated Word Pair Features for Implicit Discourse Relation Disambiguationhttp://academiccommons.columbia.edu/catalog/ac:166308
McKeown, Kathleen; Biran, Orhttp://hdl.handle.net/10022/AC:P:21954Fri, 04 Oct 2013 00:00:00 +0000We present a reformulation of the word pair features typically used for the task of disambiguating implicit relations in the Penn Discourse Treebank. Our word pair features achieve significantly higher performance than the previous formulation when evaluated without additional features. In addition, we present results for a full system using additional features which achieves close to state of the art performance without resorting to gold syntactic parses or to context outside the relation.Computer science, Linguisticskrm8, ob2008Computer ScienceArticlesSubgroup Detection in Ideological Discussionshttp://academiccommons.columbia.edu/catalog/ac:166208
Abu-Jbara, Amjad; Dasigi, Pradeep; Diab, Mona; Radev, Dragomirhttp://hdl.handle.net/10022/AC:P:21923Thu, 03 Oct 2013 00:00:00 +0000The rapid and continuous growth of social networking sites has led to the emergence of many communities of communicating groups. Many of these groups discuss ideological and political topics. It is not uncommon that the participants in such discussions split into two or more subgroups. The members of each subgroup share the same opinion toward the discussion topic and are more likely to agree with members of the same subgroup and disagree with members from opposing subgroups. In this paper, we propose an unsupervised approach for automatically detecting discussant subgroups in online communities. We analyze the text exchanged between the participants of a discussion to identify the attitude they carry toward each other and towards the various aspects of the discussion topic. We use attitude predictions to construct an attitude vector for each discussant. We use clustering techniques to cluster these vectors and, hence, determine the subgroup membership of each participant. We compare our methods to text clustering and other baselines, and show that our method achieves promising results.Computer science, Linguisticspd2359Computer Science, Center for Computational Learning SystemsArticlesCODACT: Towards Identifying Orthographic Variants in Dialectal Arabichttp://academiccommons.columbia.edu/catalog/ac:166199
Dasigi, Pradeep; Diab, Monahttp://hdl.handle.net/10022/AC:P:21920Thu, 03 Oct 2013 00:00:00 +0000Dialectal Arabic (DA) is the spoken vernacular for over 300M people worldwide. DA is emerging as the form of Arabic written in online communication: chats, emails, blogs, etc. However, most existing NLP tools for Arabic are designed for processing Modern Standard Arabic, a variety that is more formal and scripted. Apart from the genre variation that is a hindrance for any language processing, even in English, DA has no orthographic standard, compared to MSA that has a standard orthography and script. Accordingly, a word may be written in many possible inconsistent spellings rendering the processing of DA very challenging. To solve this problem, such inconsistencies have to be normalized. This work is the ﬁrst step towards addressing this problem, as we attempt to identify spelling variants in a given textual document. We present an unsupervised clustering approach that addresses the problem of identifying orthographic variants in DA. We employ different similarity measures that exploit string similarity and contextual semantic similarity. To our knowledge this is the ﬁrst attempt at solving the problem for DA. Our approaches are tested on data in two dialects of Arabic - Egyptian and Levantine. Our system achieves the highest Entropy of 0.19 for Egyptian (corresponding to 68% cluster precision) and Levantine (corresponding to 64% cluster precision) respectively. This constitutes a signiﬁcant reduction in entropy (from 0.47 for Egyptian and 0.51 for Levantine) and improvement in cluster precision (from 29% for both) from the baseline.Computer science, Languagepd2359Computer Science, Center for Computational Learning SystemsArticlesUnit Test Virtualization with VMVMhttp://academiccommons.columbia.edu/catalog/ac:165654
Bell, Jonathan Schaffer; Kaiser, Gail E.http://hdl.handle.net/10022/AC:P:21764Mon, 23 Sep 2013 00:00:00 +0000Testing large software packages can become very time intensive. To address this problem, researchers have investigated techniques such as Test Suite Minimization. Test Suite Minimization reduces the number of tests in a suite by removing tests that appear redundant, at the risk of reducing fault-finding ability since it can be difficult to identify which tests are truly redundant. We take a completely different approach to solving the same problem of long running test suites by instead reducing the time needed to execute each test, an approach that we call Unit Test Virtualization. We describe the empirical analysis that we performed to ground our approach and provide an implementation of Unit Test Virtualization targeting Java applications. We evaluated our implementation, VMVM, using 20 real-world Java applications and found that it reduces test suite execution time by up to 97 percent (on average, 62 percent) when compared to traditional unit test execution. We also compared VMVM to a well known Test Suite Minimization technique, finding the reduction provided by VMVM to be four times greater, while still executing every test with no loss of fault-finding ability.Computer sciencejsb2125, gek1Computer ScienceTechnical reportsOn Effectiveness of Traffic Analysis Against Anonymity Networks Using Netflowhttp://academiccommons.columbia.edu/catalog/ac:165651
Chakravarty, Sambuddho; Polychronakis, Michalis; Portokalidis, Georgios; Barbera, Marco V.; Keromytis, Angelos D.http://hdl.handle.net/10022/AC:P:21763Mon, 23 Sep 2013 00:00:00 +0000Low-latency anonymity preserving networks, such as Tor, are geared towards preserving anonymity of users of semi-interactive Internet applications such as web and instant messaging. In an attempt to maintain users' quality of service, such systems maintain packet inter-arrival characteristics (such as inter-packet delay). Thus, an adversary having access traffic patterns at various points of the Tor network can observer similarities in these patterns, and discover a relationship between otherwise apparently unrelated network connections. Such attacks are commonly known as Traffic Analysis attacks. In the past, various traffic analysis attacks against Tor have been explored. Most modern networking equipment have traffic monitoring subsystems built into them, e.g. Cisco's Netflow. An adversary could potentially utilize the network statistics, derived from such subsystems, to launch traffic analysis attacks. In the paper "Sampled Traffic Analysis by Internet-Exchange-Level Adversaries," Murdoch and Zielinski presented two novel contributions -- 1) a case study to show that a very small number of Internet Exchanges (IXes) intercept and could thus monitor a significant fractions of network paths from Tor nodes to various popular Internet destinations and 2) a mathematical model to classify and de-anonymize anonymous traffic. Our research complements their efforts. We focus on the possible "next-step" of the problem, viz. evaluating the feasibility and effectiveness of practical traffic analysis, using Netflow data, to determine the source of anonymous traffic. We present an active traffic analysis method that involves deliberately modulating the traffic characteristics by perturbing entering the Tor network, and observing a similar perturbation in the the traffic leaving the network. Our method relies on statistical correlation to observe such perturbations. We evaluate the accuracy of our method in both an controlled lab environment and using data gathered from a public Tor relay, serving several hundreds of Tor users. In the in-lab tests, we achieved an accuracy of 100 percent in being able to identify the source of anonymous traffic. In case of tests involving data from the public Tor relay, we achieved an overall accuracy of about 80 percent.Computer sciencesc2516, mp3018, ak2052Computer ScienceTechnical reportsLDC Arabic Treebanks and Associated Corpora: Data Divisions Manualhttp://academiccommons.columbia.edu/catalog/ac:165632
Diab, Mona; Habash, Nizar Y.; Rambow, Owen; Roth, Ryan M.http://hdl.handle.net/10022/AC:P:21761Mon, 23 Sep 2013 00:00:00 +0000The Linguistic Data Consortium (LDC) has developed hundreds of data corpora for natural language processing (NLP) research. Among these are a number of annotated treebank corpora for Arabic. Typically, these corpora consist of a single collection of annotated documents. NLP research, however, usually requires multiple data sets for the purposes of training models, developing techniques, and final evaluation. Therefore it becomes necessary to divide the corpora used into the required data sets (divisions). Unfortunately, there is no universally accepted convention or standard for dividing bulk corpora. This caused different research groups to either define their own divisions (which makes comparison to similar research results difficult) or adopt existing published divisions (which do not adapt as new corpora versions are released). When a new treebank is released, a new division needs to be developed, which may or may not be consistent with the other treebank divisions. This document details a set of rules that have been defined to enable consistent divisions for old and new Arabic treebanks (ATB) and related corpora. These rules have been applied to the currently available LDC Modern Standard Arabic Treebanks (ATB1 - ATB12), the Egyptian Arabic Treebanks (ARZ1 - ARZ8) and the spoken Levantine ATB, and the exact divisions are listed in tables.Computer science, Information sciencenh2142, ocr2101, rmr48Computer Science, Center for Computational Learning SystemsTechnical reports