Dataiku Q & A - Recent questions tagged fileshttps://answers.dataiku.com/tag/files
Powered by Question2AnswerServer filesystem using virtual machinehttps://answers.dataiku.com/3451/server-filesystem-using-virtual-machine
I have a big dataset csv file (43gb) and I can't upload it due to the size limit. I am on windows 7 so I use a Virtual Machine to work with dataiku.<br />
<br />
Someone adviced me to push the file directly to the virtual machine and to use a server filesystem option, but how can I do it ?<br />
I search for tutorials to do that but I can't find anything.<br />
I really need help on this one. Thanks in advancehttps://answers.dataiku.com/3451/server-filesystem-using-virtual-machineSat, 04 Aug 2018 18:47:35 +0000Push a file to the server filesystemhttps://answers.dataiku.com/3433/push-a-file-to-the-server-filesystem
I tried to put a file in the server filesystem but I can't find it when I try to import the dataset.<br />
<br />
Is there any tutorial or guidelines to do it correctly ?https://answers.dataiku.com/3433/push-a-file-to-the-server-filesystemThu, 02 Aug 2018 18:11:46 +0000Issuing uploading filehttps://answers.dataiku.com/3417/issuing-uploading-file
I tried to upload a csv file (43gb) and it stopped after a few second, then it started to load again from the beginning a couple of times and then, the uploading file disappears.<br />
<br />
I use a VMbox on windows 7 so I uninstall and re-install VMBox and dss in another hard drive (with 123Gb of free space).<br />
Now, it uploads the file up to 1/3 of the size and then it stops.<br />
<br />
Is there any way that I could upload the file ?<br />
<br />
Ps: Concerning the prerequisites, my computer has 8 Gb RAM, a 64 Bit OS and the lastest version of VirtualBoxhttps://answers.dataiku.com/3417/issuing-uploading-fileWed, 01 Aug 2018 19:35:44 +0000Output tab file to managed folder in DSShttps://answers.dataiku.com/2314/output-tab-file-to-managed-folder-in-dss
<p>Hi,
<br>
I have a dataframe in my DSS workflow which I want to change and store in a non-csv file within a folder.</p>
<p>&nbsp;</p>
<p>Assume my dataframe is called <strong>df</strong>&nbsp;and for the example you can recreate is as follows</p>
<p><strong>df = pd.DataFrame({"a": [1,2,3,4,5], "b": [6,7,8,9,10], "c": [11,12,13,14,15]})</strong></p>
<p>I now want to add a few lines of comment above the dataframe and then save the file automatically in a folder.</p>
<p>Firstly, I have taken my dataset and load it into a folder ("my_input_folder")&nbsp;with the DSS recipe "Export to folder" calling the file&nbsp;<strong>df.csv.</strong> Then I have added a python script which reads the file, adds the comments and output it in another folder ("my_output_folder"). The&nbsp;code is below but it didn't get what I wanted. Could you please help?</p>
<p>&nbsp;</p>
<p># -*- coding: utf-8 -*-
<br>
import dataiku
<br>
import pandas as pd, numpy as np
<br>
from dataiku import pandasutils as pdu
<br>
import os.path</p>
<p># Recipe inputs</p>
<p>folder_path = dataiku.Folder("my_input_folder").get_path()
<br>
path_of_csv = os.path.join(folder_path, "df.csv")&nbsp;</p>
<p>&nbsp; &nbsp;&nbsp;
<br>
# Recipe outputs
<br>
output2 = dataiku.Folder("my_output_folder")
<br>
output2_path = output2.get_path()</p>
<p>
<br>
completeName = os.path.join(folder_path, &nbsp;"df.csv") &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</p>
<p>file1 = open(completeName, "w")</p>
<p>toFile = raw_input("# This is my first&nbsp;comment\n This is my other comment \n") # I need to write two comments on two different rows</p>
<p>file1.write(toFile)</p>
<p>file1.close()</p>
<p>dirPath2 = os.path.join(output2_path,file1)</p>
<p>&nbsp;</p>
<p>Thank you!</p>
<p>&nbsp;</p>https://answers.dataiku.com/2314/output-tab-file-to-managed-folder-in-dssFri, 12 Jan 2018 11:51:38 +0000how to ingest files via browser upload?https://answers.dataiku.com/2269/how-to-ingest-files-via-browser-upload
Can you use the API (or other method) to upload a file via browser and provide DI workflow status via API response?https://answers.dataiku.com/2269/how-to-ingest-files-via-browser-uploadThu, 21 Dec 2017 00:50:52 +0000Using custom module python in recipehttps://answers.dataiku.com/1821/using-custom-module-python-in-recipe
Hi<br />
<br />
Can I import custom module python in recipe?<br />
<br />
If it can use , How can I use It.<br />
<br />
Thank you for the replyhttps://answers.dataiku.com/1821/using-custom-module-python-in-recipeMon, 18 Sep 2017 09:15:27 +0000Argument List too Long error which is independent on the recipe.https://answers.dataiku.com/1638/argument-list-too-long-error-which-is-independent-the-recipe
<p>I have a Filesystem datasource which is contains thousands of folders and each folder contains a list of comma separated files. &nbsp;Each file in each directory contains a different schema and the file name criteria is used to create partitioned data sources with the following using the following format:</p>
<p>/%{DIR_NAME}/KEY_%{DIR_NAME}.csv</p>
<p>This creates a datasource based on all the files that start with KEY in its name. &nbsp;That part is working as expected. &nbsp;My problem is that I can't do any recipe against that data source. &nbsp;I tried python, shell and sync recipes and all of the failed with the same error:</p>
<pre style="color: rgb(0, 0, 0);">
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at com.dataiku.dip.security.process.RegularProcess.start(RegularProcess.java:47)
at com.dataiku.dip.security.process.InsecureProcessesLaunchService.launch(InsecureProcessesLaunchService.java:34)
at com.dataiku.dip.dataflow.exec.AbstractCodeBasedActivityRunner.execute(AbstractCodeBasedActivityRunner.java:263)
at com.dataiku.dip.dataflow.exec.AbstractCodeBasedActivityRunner.execute(AbstractCodeBasedActivityRunner.java:231)
at com.dataiku.dip.dataflow.exec.AbstractPythonRecipeRunner.executeScript(AbstractPythonRecipeRunner.java:37)
at com.dataiku.dip.recipes.code.python.PythonRecipeRunner.run(PythonRecipeRunner.java:49)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:353)
Caused by: java.io.IOException: error=7, Argument list too long
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.&lt;init&gt;(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)</pre>
<p>My current recipe is in python and the code is:</p>
<p># -*- coding: utf-8 -*-
<br>
import dataiku
<br>
import pandas as pd, numpy as np
<br>
from dataiku import pandasutils as pdu</p>
<p># Recipe inputs</p>
<p>print("Here")</p>
<p>events_CSV = dataiku.Dataset("KEY_CSV")
<br>
events_CSV_df = events_CSV.get_dataframe()</p>
<p># Recipe outputs
<br>
events_ORC = dataiku.Dataset("KEY_ORC")
<br>
events_ORC.write_with_schema(events_CSV_df)</p>
<p>Job fails before printing "Here".</p>
<p>These are the DSS instance settings:</p>
<pre class="ng-scope ng-binding" style="padding: 9.5px; font-family: Monaco, Menlo, Consolas, &quot;Courier New&quot;, monospace; font-size: 10px; border-radius: 0px; margin-top: 0px; margin-bottom: 0px; line-height: 1.2; background-color: rgb(245, 245, 245); border: 1px solid rgba(0, 0, 0, 0.15); word-break: normal !important;">
{u'dipInstanceId': u'8bu1n1os-203c299d56c99ef078a53a1a81b6ea23-c60f6bab8e57ecd615a8ec240207f819', u'features': {u'TWITTER': {}, u'HADOOP': {}, u'HIVE': {}, u'PIG': {}, u'R': {}, u'SPARK': {}}, u'devInstance': False, u'distribVersion': u'7.3', u'debug': False, u'version': {u'product_commitid': u'', u'conf_version': u'16', u'product_version': u'4.0.5'}, u'distrib': u'redhat'}</pre>
<p>&nbsp;</p>https://answers.dataiku.com/1638/argument-list-too-long-error-which-is-independent-the-recipeWed, 26 Jul 2017 21:39:53 +0000How to apply custom Python code to multiple csv files in a folder?https://answers.dataiku.com/1520/how-apply-custom-python-code-to-multiple-csv-files-in-folder
<p>I want to do the above in the free version of DSS (v4.0.5). I created a Filesystem dataset and pointed it at the folder containing my csv input files. All the csv files have the same schema. However, when I create the dataset it only appears to 'see' one of the csv files. So when I run my flow it only processes one of them. But I want to process all the files in order (e.g., by alphabetical order of the input file names&nbsp;for example), feeding the data from each file into my custom code one file's-worth at a time .</p>
<p>Is there any way I can do this <em>without </em>having to write my custom code so that it opens the folder and processes the files in a loop? (E.g., a bit like at&nbsp;<a href="https://answers.dataiku.com/1347/read-csvs-from-a-folder" rel="nofollow" target="_blank">https://answers.dataiku.com/1347/read-csvs-from-a-folder</a>)</p>https://answers.dataiku.com/1520/how-apply-custom-python-code-to-multiple-csv-files-in-folderThu, 29 Jun 2017 15:33:54 +0000How to apply a workflow to several fileshttps://answers.dataiku.com/1290/how-to-apply-a-workflow-to-several-files
<p><img alt="" src="https://answers.dataiku.com/?qa=blob&amp;qa_blobid=16694530549654806401" style="height:185px; width:600px"></p>
<p>Please see the workflow above. I would like to apply the lower part of the workflow, that is currently applied only to one file, to all files in a directory, and then concatenate all results. How can I do this with dataiku?</p>https://answers.dataiku.com/1290/how-to-apply-a-workflow-to-several-filesTue, 25 Apr 2017 10:55:01 +0000