Writing Large CSV Files

Jun 15, 2016 at 4:32 PM
Hi guys

I quite regularly run queries against large PowerBI datasets (300+ million rows) and I notice that DAX Studio doesn't handle extracting data to CSV very well.

It appears to cache the entire data set to memory (ballooning out my memory on a 16G laptop) prior to writing to disk. Is it possible to fix this?

Further, large queries sometimes receive the following error, "Specified cast is not valid". The way around this appears to be setting the table in PBI (during ETL in PQ) to strings.


Thanks,
Simon
Coordinator
Jun 16, 2016 at 1:09 AM
I would have thought it would be more efficient to extract data from the original source system. Is there something blocking you from running your extracts from the original source system?

You are right that it currently caches the entire data set in memory, This is what we have to do in order to bind the data to the grid output. The CSV extract is currently just built off the back of the same query engine and just loops through the data set and writes it out to disk. It might be possible for us to stream the data through a DataReader rather than caching it in a DataSet, but the query engine runs asynchronously (so that you can have multiple tabs each running different queries) so it would be a non-trivial change, but we have been thinking about this as it may also allow us to cancel queries part way through, but still show a partial result set.
Further, large queries sometimes receive the following error, "Specified cast is not valid".
This is not something we've seen before it maybe something specific to a give data type, but my guess would be that it's probably related to some specific value in your data. I'll see if it's possible to get any extra information into the error message that will help us diagnose this
Jun 16, 2016 at 1:28 AM
Thanks for the quick reply dgosbell!

I'd also like to thank you for designing such an amazing product btw. I use it very regularly.

I think there is value in allowing users to extract large data sets from PBI easily. This allows your average joe to perform one-off ETLs leveraging Power Query instead of other ETL software like SSIS. In fact I believe CSV extraction is a fairly highly voted request on the PBI feedback website.

Just a heads up - my current approach is to set up a linked SQL Server to the PBI Tabular instance and then run a T-SQL command to extract data to csv - it's very reliable but my goodness is it slow!
Sep 23, 2016 at 4:00 PM
Just want to bump this. Has there been anymore thought on leveraging a DataReader when export to CSV is selected?

This would be super useful :)
Coordinator
Sep 24, 2016 at 4:30 AM
I have started having a look at this and I think I have a design for how to fit a data reader into the architecture. I also think it will actually allow us to implement a couple of other interesting features. We are currently stabilizing the next release so we will most likely look at this for the release after that.
Sep 24, 2016 at 9:45 AM
Thanks mate! This is great news!

Just a heads up, I haven't figured out what gives the "specified cast is not valid" but I'm still able to replicate the bug on new installs/versions of PBI and DAXStudio on different laptops and completely different data models. If there's a way I could run some kind of error dump then I'd be happy to give you the export. Unfortunately I can't share the data models due to client confidentiality.