By Peter Styliadis on SAS UsersTopics | Programming Tips
Welcome to the continuation of my seriesGetting Started with Python Integration to SAS Viya. In this post I'll discuss how to load multiple CSV files into memory as a single table using the loadTable action.
Load and prepare data on the CAS server
To start, we need to create multiple CSV files in a folder. I created this script to load and prepare the CSV files in the CAS server for this post. This should work in your environment. The script does the following:
- Loads the WARRANTY_CLAIMS_0117.sashdat files from the Samples caslib into memory.
- Modifies the in-memory table by renaming and dropping columns.
- Adds a sub directory in the Casuser caslib named csv_file_blogs.
- Saves a CSV file for each distinct year in the csv_file_blogs folder.
- Drops the original CAS table.
import swatimport pandas as pd## Connect to CASconn = swat.CAS(Enter your CAS connection information)#### Data prep functions ##def prep_data(): """ Load and prepare the warranty_claims_0017.sashdat file in CAS """ ## Load the WARRANTY_CLAIMS_0117.sashdat from the Samples caslib into memory in Casuser conn.loadTable(path='WARRANTY_CLAIMS_0117.sashdat', caslib='samples', casout={'name':'warranty_claims', 'caslib':'casuser', 'replace':True}) ## ## DATA PREP ## ## Reference the CAS table in an object castbl = conn.CASTable('warranty_claims', caslib = 'casuser') ## Store the column names and labels in a dataframe df_col_names = castbl.columnInfo()['ColumnInfo'].loc[:,['Column','Label']] ## Create a list of dictionaries of how to rename each column using the column labels renameColumns = [] for row in df_col_names.iterrows(): colName = row[1].values[0] labelName = row[1].values[1].replace(' ','_') renameColumns.append(dict(name=colName, rename=labelName)) ## List of columns to keep in the CAS table keepColumns = {'Campaign_Type', 'Platform','Trim_Level','Make','Model_Year','Engine_Model', 'Vehicle_Assembly_Plant','Claim_Repair_Start_Date', 'Claim_Repair_End_Date'} ## Rename and drop columns to make the table easier to use castbl.alterTable(columns = renameColumns, keep = keepColumns) ## Return the CASTable object reference return castbldef save_cas_table_as_csv_files(cas_table_reference): """ Create a subdirectory in Casuser and save mutliple CSV files in it. """ ## Create a subdirectory in the Casuser caslib named csv_file_blogs conn.addCaslibSubdir(name = 'casuser', path = 'csv_file_blogs') ## Create a CSV file for each year for year in list(castbl.Model_Year.unique()): (cas_table_reference .query(f"Model_Year ='{year}'") .save(name = f'csv_file_blogs/warranty_claims_{year}.csv', caslib = 'casuser', replace = True) ) ## Drop the CAS Table cas_table_reference.dropTable() ## View files in the csv_file_blogs subdirectory fi = conn.fileInfo(allFiles = True, caslib = 'casuser') fi_subdir = conn.fileInfo(path = 'csv_file_blogs', caslib = 'casuser') display(fi, fi_subdir)## Create the CAS tablecastbl = prep_data()## Save the CAS table as a CSV file for each yearsave_cas_table_as_csv_files(castbl)## and the resultsNOTE: Cloud Analytic Services made the file WARRANTY_CLAIMS_0117.sashdat available as table WARRANTY_CLAIMS in caslib CASUSER(Peter).NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2015.csv in caslib CASUSER(Peter).NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2016.csv in caslib CASUSER(Peter).NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2017.csv in caslib CASUSER(Peter).NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2018.csv in caslib CASUSER(Peter).NOTE: Cloud Analytic Services saved the file csv_file_blogs/warranty_claims_2019.csv in caslib CASUSER(Peter).NOTE: Cloud Analytic Services dropped table warranty_claims from caslib CASUSER(Peter.Styliadis).
The results show that five CSV files named warranty_claims<year>.csv were created in the subdirectory csv_file_blogs in the Casuser caslib.
Next, I'll use the fileInfo action to view the new csv_file_blogs subdirectory in the Casuser caslib. In the fileInfo CAS action use the includeDirectories parameter to view subdirectories.
conn.fileInfo(includeDirectories = True, caslib = 'casuser')
Lastly, I'll view the available files in the csv_file_blogs subdirectory. To view files in a subdirectory in a caslib add the folder name in the path parameter.
conn.fileInfo(path = 'csv_file_blogs', caslib = 'casuser')
The results show that the subdirectory has five CSV files.
Load all of the CSV files as a single table
To easily load multiple CSV files as a single in-memory CAS table use the table.loadTable CAS action. The only requirements when loading multiple CSV files:
- All of the CSV files must end with .csv.
- Set the multiFile option to True in the importOptions parameter.
- The CSV files must have the same number of columns and the columns must have the same data type.
Here, I'll use the loadTable action with the path parameter to specify the subdirectory csv_file_blogs and the caslib parameter to specify the Casuser caslib. The importOptions parameter uses the fileType option to determine the other parameters that apply. The CSV file type has a multiFile option. If you set the multiFile option to True, it will try to load all CSV files from the path into a single in-memory table. The casOut parameter simply specifies output CAS table information.
conn.loadTable(path="csv_file_blogs", caslib = 'casuser', ## Specify the subdirectory name (csv_file_blogs) and the input caslib name importOptions = { ## Specify the import options 'fileType' : 'CSV', 'multiFile' : True }, casOut = { ## Specify output CAS table information 'name' : 'allCSVFiles', 'caslib' : 'casuser', 'replace' : True })## and the resultsNOTE: The file, '/cas/data/caslibs/casuserlibraries/peter/csv_file_blogs/warranty_claims_2015.csv' was used to create the CAS Table column names.NOTE: The CSV file table load for table, 'allCSVFiles' produced 153217 rows from 5 files.NOTE: Cloud Analytic Services made the file csv_file_blogs available as table ALLCSVFILES in caslib CASUSER(Peter).
The action concatenated each CSV file and loaded them as a single distributed CAS table named ALLCSVFILES.
Next, I'll run the tableInfo action to view available in-memory tables in the Casuser caslib.
conn.tableInfo(caslib = 'casuser')
The action results show one CAS table is in memory.
Then I'll make a client-side reference to the distributed CAS table and run the head method from the SWAT package.
allcsvfilesTbl = conn.CASTable('allcsvfiles', caslib = 'casuser')allcsvfilesTbl.head()
Finally, I'll run the SWAT value_counts on the Model_year column.
(allcsvfilesTbl ## CAS table .Model_Year ## CAS column .value_counts() ## SWAT value_counts method)## and the results2017.0 704792018.0 439752016.0 327072019.0 35102015.0 2546dtype: int64
The results show all five years of data were imported into the CAS table, one for each CSV file.
Add file name and path columns to the CAS table
The importOptions parameter has a variety of CSV options you can use to modify how to import the files. Two useful options are showFile and showPath. The showFile option includes a column that shows the CSV file name. The showPath option includes a column that shows the fully-qualified path to the CSV file.
conn.loadTable(path="csv_file_blogs", caslib = 'casuser', ## Specify the subdirectory name (csv_file_blogs) and the input caslib name importOptions = { ## Specify the import options 'fileType' : 'CSV', 'multiFile' : True, 'showFile' : True, 'showPath' : True }, casOut = { ## Specify output CAS table information 'name' : 'allCSVFiles_path_info', 'caslib' : 'casuser', 'replace' : True })## and the resultsNOTE: The file, '/cas/data/caslibs/casuserlibraries/peter/csv_file_blogs/warranty_claims_2015.csv' was used to create the CAS Table column names.NOTE: The CSV file table load for table, 'allCSVFiles_path_info' produced 153217 rows from 5 files.NOTE: Cloud Analytic Services made the file csv_file_blogs available as table ALLCSVFILES_PATH_INFO in caslib CASUSER(Peter).
I'll run the tableInfo action to view available CAS tables.
conn.tableInfo(caslib = 'casuser')
The action shows that two CAS tables are now in-memory.
I'll reference and preview the new CAS table .
allcsvfiles_path_infoTbl = conn.CASTable('allcsvfiles_path_info', caslib = 'casuser')allcsvfiles_path_infoTbl.head()
The preview shows the new CAS table has a column named path and a column named fileName.
Lastly, I'll use the SWAT value_counts method to view how many rows came from each CSV file. I'll specify the CAS table reference, the column name, then the value_counts method.
(allcsvfiles_path_infoTbl .fileName .value_counts())## and the resultswarranty_claims_2017.csv 70479warranty_claims_2018.csv 43975warranty_claims_2016.csv 32707warranty_claims_2019.csv 3510warranty_claims_2015.csv 2546dtype: int64
The results show the CSV files were concatenated into a single CAS table. We can see how many rows came from each file.
Summary
The SWAT package blends the world of pandas and CAS to process your distributed data. In this example I focused on using table.loadTable CAS action to concatenate multiple CSV files into a single distributed CAS table using a single method.
Additional and related resources
- Getting Started with Python Integration to SAS® Viya® - Index
- SWAT API Reference
- table.loadTable CAS action
- SAS® Cloud Analytic Services: Fundamentals
- CAS Action Documentation
- CAS Action! - a series on fundamentals
- SAS Course - SAS® Viya® and Python Integration Fundamentals
Tags CAS CAS Action Sets CAS Actions Developer Developers loading CSV files loadTable action Programming Tips Python Python Integration to SAS Viya SAS Programmers SAS Viya SAS Viya Programming SWAT
FAQs
How to combine multiple CSV files using Python for your analysis? ›
- Create a list containing all CSV files we will merge.
- Create an empty dataframe.
- Use the for loop to iterate over the files.
- In each iterate, use read_csv() to read CSV files and concatenate using the pd. concat() method.
- Use to_csv() to write merged data into a new CSV file.
You can do this by reading each CSV file into DataFrame and appending or concatenating the DataFrames to create a single DataFrame with data from all files. Here, I will use read_csv() to read CSV files and concat() function to concatenate DataFrams together to create one big DataFrame.
How to read multiple CSV files in python using loop? ›- Instantiating an Empty List: We do this to store our results as we make them in the for-loop.
- For-Each filename, read and append: We read using pd. read_csv() , which returns a data frame for each path. ...
- Combine each Data Frame: We use pd.
- On the Ablebits Data tab, click Copy Sheets and indicate how you want to import the files: ...
- Click the Add files button, and then find and select the csv files for importing. ...
- Finally, the add-in will ask exactly how you want to paste the data.
- Note: with a few small changes you can also use this for txt files. ...
- 1) Windows Start Button | Run.
- 2) Type cmd and hit enter ("command" in Win 98)
- 3) Go to the folder with the CSV files (for help how to do that enter "help cd")
- 4) Type copy *. ...
- 5) Type exit and hit enter to close the DOS window. ...
- 1) Open Excel.
The importFolder (R)/ import_file (Python) function can be used to import multiple local files by specifying a directory and a pattern. Example patterns include: pattern="/A/.
How to read multiple CSV file in Python without pandas? ›CSV Python's built-in library can be used to read csv files without using pandas. Here we are using the reader() function to read the data from the file. Although load_csv() is a helpful function, it has some limitations. While reading files, it doesn't handle empty spaces/row.
How do I organize CSV data in Python? ›To sort CSV by multiple columns, use the sort_values() method. Sorting by multiple columns means if one of the columns has repeated values, then the sort order depends on the 2nd column mentioned under sort_values() method.
How to split a CSV file into multiple CSV files using Python? ›Split a CSV file into multiple files using Python
writer" modules in Python. These modules allow you to read and write CSV files, respectively. To split a CSV file, we will first read the file using the "csv. reader" module and then write the data to multiple smaller files using the "csv.
A CSV file (Comma Separated Values file) is a delimited text file that uses a comma , to separate values. It is used to store tabular data, such as a spreadsheet or database. Python's Built-in csv library makes it easy to read, write, and process data from and to CSV files.
How do I combine multiple CSV files into one CSV file? ›
- Launch CSV files that you want to merge in Google Sheets or Microsoft Excel.
- Create a blank CSV file – where you can paste the data copied from other CSV files one by one. Let's call it the “Master CSV” file.
- Manually copy data from CSV files one by one and paste all of it to the Master CSV file.
To merge all excel files in a folder, use the Glob module and the append() method. Note − You may need to install openpyxl and xlrd packages.
How to merge multiple Excel files into one using command prompt? ›- Press Alt + F8 to open the Macro dialog.
- Under Macro name, select MergeExcelFiles and click Run.
- The standard explorer window will open, you select one or more workbooks you want to combine, and click Open.
So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit or, find an Excel add-in that supports CSV files with a higher number of rows.
How do I import a multiline CSV file into Excel? ›Multiline CSV can be imported easily in Excel versions with Power Query using following steps (tested in Excel 365 version 2207): Go to Data-tab. Click "From Text/CSV" on the ribbon. Select file and click Import.
What software can split a CSV file into multiple files? ›Microsoft Excel and Google Sheets are two popular programs that can be used for this purpose. In these programs, you can open the CSV file and then save it as multiple files. This is a simple and efficient way to split the file, and it can be done quickly with a few clicks.
How do I merge CSV files without headers? ›- Automate file download from web with login and multiple subrouts for each file.
- concatenate all files into a directory with python.
- csv merging issue, python.
- Merge file but only output on header row.
- Using Pandas to concatenate CSV files in directory, recursively.
- Step One: Split file. $ split -l 5000 users.csv ./split-files. ...
- Step two: Appending '.csv' to each file. $ cd ./split-files $ for f in *; do echo mv "$f" "$f.csv"; done.
- Step three: Adding header to each file.
- Be sure you have pandas installed. pip install pandas.
- Import pandas into your jupyter notebook. Import pandas as pd. Try to read your file and check for other file formats that can be read in python Data = pd.read_#fileformat(filename) (#fileformat is just a place holder for the file format)
Use the glob function in the python library glob to find all the files you want to analyze. You can have multiple for loops nested inside each other. Python can only print strings to files. Don't forget to close files so python will actually write them.
Can you read multiple files at once in Python? ›
If you are familiar with the open() function, then you might know that it takes only one file path at a time. Hence, we cannot pass multiple files path and if we try to do so then we get an error. But there is a way, we can use the with statement to open and read multiple files using the open() function.
What is the fastest way to read a CSV file in Python? ›Measured purely by CPU, fastparquet is by far the fastest. Whether it gives you an elapsed time improvement will depend on whether you have existing parallelism or not, your particular computer, and so on. And different CSV files will presumably have different parsing costs; this is just one example.
How do I read multiple files in a loop in Python? ›- Create a list of file names. This requires you to enter the file names manually. ...
- Create a variable to store the file contents. This variable will store the text of the file for each iteration. ...
- Use a "for" loop to cycle through each file name in the file name list.
To read CSV file without header, use the header parameter and set it to “None” in the read_csv() method.
Which CSV format to use for Python? ›To write to a CSV file in Python, we can use the csv. writer() function. The csv. writer() function returns a writer object that converts the user's data into a delimited string.
Which library is used for CSV files in Python? ›The Python csv library will work for most cases. If your work requires lots of data or numerical analysis, the pandas library has CSV parsing capabilities as well, which should handle the rest. In this article, you'll learn how to read, process, and parse CSV from text files using Python.
How to import CSV file to database using Python? ›- Step 1: Prepare the CSV File. ...
- Step 2: Import the CSV File into a DataFrame. ...
- Step 3: Connect Python to SQL Server. ...
- Step 4: Create a Table in SQL Server using Python. ...
- Step 5: Insert the DataFrame Data into the Table. ...
- Step 6: Perform a Test.
- Installing pandas.
- Preparing Data.
- Using the pandas read_csv() and .to_csv() Functions. Write a CSV File. ...
- Using pandas to Write and Read Excel Files. Write an Excel File. ...
- Understanding the pandas IO API. Write Files. ...
- Working With Different File Types. CSV Files. ...
- Working With Big Data. ...
- Conclusion.
- Step 1: Install the Pandas package. If you haven't already done so, install the Pandas package. ...
- Step 2: Capture the path where your text file is stored. ...
- Step 3: Specify the path where the new CSV file will be saved. ...
- Step 4: Convert the text file to CSV using Python.
The first approach is by using the split() method. This method takes a parameter, and delimiters the character at which the string should be split. Since the input is a csv, which is comma-separated we will be setting the delimiter as comma and split the string into array.
How to copy rows from one CSV to another CSV file using Python? ›
- Read each of the 50 rows from in. csv and append each one to out. csv .
- Write the remaining rows from in. csv to a temporary file.
- Delete the original in. csv file and rename the temporary file to be in. csv .
Use the append file mode, "a", in the open statement. This is what I mean: open("myfile. csv", "a") The "a" lets you add new rows at the end of the file without overwriting existing rows. Also if the file does net yet exist, then it creates the file for you and lets you write to it.
How do I save a Python output to a CSV file? ›- Open the CSV file in writing (w mode) with the help of open() function.
- Create a CSV writer object by calling the writer() function of the csv module.
- Write data to CSV file by calling either the writerow() or writerows() method of the CSV writer object.
- Finally, close the CSV file.
- import pandas as pd.
- csv1 = pd.read_csv("data/TurnoverList.csv") csv1.head()
- csv2 = pd.read_csv("data/EquityList.csv") csv2.head()
- merged_data = csv1.merge(csv2,on=["Security Code"]) merged_data.head()
- data1 = pd.read_csv("data/sheet1.csv") data1.head()
- Open the large file in Microsoft Excel.
- Sort the data based on the column that you want to split the data into smaller files.
- Create a new workbook in Microsoft Excel for each unique value in the column.
- For each unique value, filter the data to show only the rows with that value in the column.
- Step 1: Import the modules. ...
- Step 2: Read the Excel Files. ...
- Step 3: Join operations on the Data frames. ...
- Step 4: write result to the csv file.
- Get a list of names of all worksheets, either using openpyxl or pandas .
- Iterate through each worksheet, parse each sheet as a Pandas DataFrame, and append each DataFrame to another list.
- Merge all into a single DataFrame using pd. concat .
Using Loops
The list of filenames or file paths is then iterated over. Each file generates a file descriptor, reads its contents line by line, and then writes the information to the advanced_file.py file. It adds a newline character, or \n, to the new file at the end of each line.
- Move every file you want to merge into a single folder (or organized in subfolders).
- Select the Data tab.
- Click Get Data, then From File, and From Folder.
- Choose the folder directory.
- Click OK.
- Click Combine & Load to merge your Excel files.
On the Data tab, under Tools, click Consolidate. In the Function box, click the function that you want Excel to use to consolidate the data. In each source sheet, select your data, and then click Add.
How to combine multiple Excel files into one worksheet using Power Automate? ›
- Sign into Power Automate and create a new Instant cloud flow.
- Choose Manually trigger a flow and select Create.
- Add a New step to get all the workbooks you want to combine from their folder.
- Step 1: Import the Necessary Libraries. import pandas as pd. ...
- Step 2: Load the Dataset. I have created two CSV datasets on Stocks Data one is a set of stocks and the other is the turnover of the stocks. ...
- Step 3: Merge the Sheets.
we will create "data" list with values for write multiple rows to csv file. we will use open(), writer(), writerow(), writerows() and close() functions to create csv file from list. writerows(): it will write multiple rows with list.
How do I combine multiple workbooks into one? ›Open the original Shared Workbook into which you want to merge changes. Click the Tools menu and then select Merge Workbooks…. If prompted, save the workbook. In the file navigation dialog box, click the copy of the workbook that contains the changes you want to merge, then click OK.
How do I split a large CSV file into multiple files? ›Using Excel: Open the large file in Excel, sort the data on the column that you want to use to split the data, then select and copy the data for each smaller file and paste it into a new workbook. Save each new workbook as a separate file.
How do I convert multiple text files to multiple CSV in Python? ›- Step 1: Install the Pandas package. If you haven't already done so, install the Pandas package. ...
- Step 2: Capture the path where your text file is stored. ...
- Step 3: Specify the path where the new CSV file will be saved. ...
- Step 4: Convert the text file to CSV using Python.
The main difference between join vs merge would be; join() is used to combine two DataFrames on the index but not on columns whereas merge() is primarily used to specify the columns you wanted to join on, this also supports joining on indexes and combination of index and columns.
How do I merge 3 datasets in Python? ›- merge(): To combine the datasets on common column or index or both.
- concat(): To combine the datasets across rows or columns.
- join(): To combine the datasets on key column or index.
It turns out the Python standard library CSV module enforces a default field size limit on columns, and anything with more than 128KB of text in a column will raise an error. You can modify this error using the csv. field_size_limit(new_limit) function.