נחשו מה הפורמט הכי נפוץ לפירסום מידע פתוח על ידי ערים בעולם?

mary · 22 ביוני,‏ 2015,‏ 9:32am

Citadel on the Move הוא פרוייקט מחקר בעלות 4 מיליון יורו של האיחוד האירופי שמתחקה אחרי פרקטיקות שימוש וניצול מידע פתוח ב-140 ערים מסביב לעולם. הם זיהו 77 פורמטים שונים כשהנפוץ מביניהם הוא CSV ואחריו XML ו-XLS

מעניין שבימיו הראשונים של מידע פתוח ה-Citadel עצמם המליצו על שימוש בטיפוס הנתונים המורכב LOD. אח"כ הם עברו ל-XML וכיום CSV הוא ההמלצה החמה.

מעניין גם שישראל לא משתתפת במחקר. מישהו רוצה לרתום את העירייה שלו?..

pwalsh · 23 ביוני,‏ 2015,‏ 7:36am

I’ve mentioned it before here in a different context, but data as CSV is so much more important than APIs, and it is great to see that this is the result of their research.

Many organizations are working on tooling around CSV for open data, including OK and most of the work I do there.

From what I’ve seen in the UK, CSV as a target format also ties in extremely well with developing a culture around open data - the people who are producing and processing reports are using CSV and tooling right at the point of data production, and therefore they are aware of transparency, accessibility and other reasons for publishing data in certain formats.

API-first, and other non-human-readable approaches do not gain this benefit - they position the issue as a technical one rather than a socio-cultural one.

Yogev · 23 ביוני,‏ 2015,‏ 10:17am

אם היו עושים את הטבלה הזאת על מדינת ישראל, היו רק 4 פורמטים שהעייקרי בהם הוא PDF.
CSV אפילו לא בפוקוס…

aaa · 23 ביוני,‏ 2015,‏ 11:05am

הכל טוב ויפה, ואכן CSV הוא שיפור לעומת PDFים (או SOAP), אבל בינינו זה פורמט ממש גרוע.

הוא מפסיק לעבוד טוב ברגע שיוצאים מגבולות המדינות דוברות האנגלית (אין תמיכה סטנדרטית בunicode)
קל מאוד לייצר קובץ לא תקין (כמו כל קבצי ה-CSV שמשרד המשפטים שלנו מפרסם)
אין הגדרות של טיפוסים, מה שמאפשר למצוא בכל קובץ פורמט שונה של תאריך או של מספרים

אם pdf הוא פורמט נעים לאנשים ונורא למכונות, אז csv הוא פורמט מעצבן ברובו לאנשים ומכונות כאחד.

פתרונות אפשריים אפשר למצוא בין השאר בעבודה ש- @pwalsh עושה ב-OKF עם json table scheme.

אדם

pwalsh · 23 ביוני,‏ 2015,‏ 11:43am

Hey @aaa,

I definitely don’t agree on CSV, but I admit I’m biased :).

Unicode - it is a non-issue. In Python 2.7 it sucks but is easy to work around. In Python 3 and Javascript I just don’t know of any issue there at all.
Definitely it is easy to create a bad file (invalid dimensions and so forth). But, we could say that about any human-editable format, no? That is why linting tools like this and this are popping up. At least in the UK, both these tools are in use as part of publishing workflows at national and regional levels.
Types - yes. But as you said, type casting libs, like those I’ve done for OK, are popping up (and integrated into linting tools), based on specs like this and this.

Let’s not forget, though, why CSV is so good as a baseline format for open data:

You can stream it over http
You can iterate row-wise over the data
You can create it and read it on any consumer PC
You can import it directly into most databases (Postgres, BigQuery etc.)
You can build javascript visualizations directly off it
Pretty much every programming language can work with it

aaa · 23 ביוני,‏ 2015,‏ 12:05pm

@pwalsh, as I said, CSV is not the worst format you could use.

However, it’s as bad format nonetheless.

The de-facto baseline for “what’s good for humans” is Microsoft Excel - you might not like it but it’s a fact.

unicode - I dare you to open a utf-8 encoded csv file with Excel
take for example this file which is published by our ministry of justice:
http://index.justice.gov.il/DataGov/Corporations/Associations.csv
one of the validators you linked to says it’s perfect, the other says it’s got hundreds of errors. it shouldn’t be ambiguous to validate a file format…
adding a schema on top of the csv is nice, but it’s then it’s not csv any more - you might call it a new format, or just a bundle of files - still not usable by humans (as you can’t really load them into Excel) and cumbersome to be used by machines.
And if we’re doing a new format, better to start with more firm foundations (e.g. json)

pwalsh · 23 ביוני,‏ 2015,‏ 2:49pm

@aaa,

With the example file, Good Tables finds it structurally valid, and looking at it manually that seems about right. I guess when they wrote CSVLint, they didn’t consider unicode support in their code, but in GoodTables, I did :). So - this is an issue with the implementation, not the file format itself.

About Excel - to be totally clear, I have no problem with Excel whatsoever. I do not believe that it is right or reasonable to expect data to be produced outside of Excel in the vast majority of cases for many, many years to come. GoodTables actually supports all Excel formats in addition to CSV, but, as you may know, Excel is not exactly a walk in the park due to the date format and other quirks.

The point is though, that Excel, the program, reads and writes CSV - this is how spend data of UK ministries is produced, in Excel, and published as both Excel and CSV.

About formats, firmer foundations and so on. It is kind of like the arguments around Javascript in comparison to other programming languages: Let’s not let best be the enemy of good.

I think CSV to open data will basically for the foreseeable future be like Javascript for web programming: sure, it is not perfect, it is even weird sometimes, missing (supposedly) fundamental data types and so on - yet, it basically has the minimum you need to get stuff done, and a growing ecosystem to fill the gaps.

aaa · 23 ביוני,‏ 2015,‏ 3:17pm

The problem with the sample file is that someone decided to put a single " character inside a text field, which is illegal (although it may seem mostly valid, it’s not)

As for Excel - the requirement should be to be able to open the published files.
Right now this is is only possible in English speaking countries, as excel doesn’t support UTF8 in CSV.

And as for best being the enemy of good - the same argument could have been used in the discussion of PDF vs CSV, or paper vs PDF etc. We’re not talking about developing something new or cutting-edge - there are better formats which are being widely used and industry accepted. Using CSV might have been nice when there were no such alternatives, now it’s just backwards.
I think that instead of going the extra work and linting their CSVs, they can simply use any CSV2JSON converter (such as these) and save us all the trouble.

pwalsh · 23 ביוני,‏ 2015,‏ 3:37pm

But, CSV to JSON doesn’t solve any problem whatsoever.

It will just give you json objects of strings, which is not useful, or, try to guess types, which is barely useful.

There are no consumer grade desktop apps that would let a user work with JSON, and streaming JSON is more difficult - there is newline-delimited JSON but try getting humans to edit one of those.

Can you send me an example of an Excel file that you think cannot be opened? UTF-8 is one thing, but we can obviously pass specific encodings to a parser, or, use character detection libraries to guess.

For best vs good - there are all sorts of formats - but a hell of a lot of open data is converging around csv now - for better or worse.

Oh, and about the quote symbol - it is legal in CSV. This file does not use quoting for enclosing fields, so that means that it is fine to use quote symbols in the field data.

This is the best reference for CSV formats, for what it is worth: http://tools.ietf.org/html/rfc4180

aaa · 23 ביוני,‏ 2015,‏ 8:03pm

I feel we’re spamming this thread, but let me just say that JSON has full unicode support and a single standard way to encode and validate - which is already a huge improvement over CSV. It also has streaming support, online editors and whatnot. It’s probably better supported by almost all programming languages.

Anyway, try to open this file in excel and lmk how it goes:
http://www.obudget.org/api/budget/00?o=csv

We can keep discussing this in private…

pwalsh · 24 ביוני,‏ 2015,‏ 5:41am

Hi,

Ok, sorry you think this is spamming. It is probably too technical now, but it does provide some context into why CSV is becoming the defacto for distribution of open data.

Especially because this is not the case in Israel, not in government and not in open data orgs like HaSadna, I thought it was relevant.

So, we can talk in private, but the last points I want to make here are:

Of course JSON is better on technical merit, but not for interop between humans and machines, that is the number one reason why we are using CSV IMHO
I can read that file you provided in goodtables, in Apple Numbers, and in Google Spreadsheets, but I don’t have Excel to check it. The thing is though, if you know that Excel can’t open a utf-8 file, and if that is a primary use case that you want to support, why are you encoding in UTF-8 instead of some windows encoding? You could even provide it as a flag on the API for an excel-compatible csv. So, to me, this is a problem with the code implementation, not a CSV format problem.

mary · 24 ביוני,‏ 2015,‏ 10:33am

you’re not spamming
reading carefully and enjoying it
plus that’s what we’re here for - technical expertise

pwalsh · 24 ביוני,‏ 2015,‏ 11:34am

great, thanks @mary - i certainly hope this is useful information