Risks of using Facebook Marketplace (or similar) for annotated data
-
Terms of Service / scraping bans — Platforms generally forbid automated scraping and reuse of content. Violating those terms can get your account blocked or expose you to legal risk.
-
Copyright & ownership — Sellers own the photos they post; you need their permission to reuse them. Public posting ≠ permission for dataset use.
-
Privacy & personal data — Marketplace posts often contain identifying information (faces, names, addresses, phone numbers). Using such data for ML can violate data-protection laws (GDPR, CCPA) unless you get consent and take protections.
-
Consent & human subjects — Even if images are “public,” using them for research/commercial ML without consent can be unethical and legally risky.
-
Bias & label quality — Marketplace images aren’t a representative sample for many tasks; labels derived from seller text are noisy.
-
Delivery / scaling problems — Marketplace isn’t designed as a data-source pipeline — collecting, cleaning, and maintaining data at scale is clumsy.
Ethical & legal safe route (if you still want to use a marketplace)
If you must use Marketplace images (e.g., for a narrow product-recognition dataset), follow this strict process:
1) Don’t scrape. Use manual / permission-based collection.
-
Do not use automated scraping bots (platform ToS risk). Instead, contact sellers manually and ask permission to use their images for your dataset.
-
Use a documented consent workflow (signed or written acceptance).
2) Ask for explicit written consent & license
-
Obtain a short written license from each contributor giving you the right to use, store, and distribute the images for the purposes you specify (training, research, public release, or internal use).
-
Give them an opt-out route and explain anonymization.
Sample short message to sellers
Hi — I’m building a dataset for a product-recognition research project and I’m requesting permission to use the photos from your listing #[listing id]. I will only use the images for [research/commercial/internal] purposes, will remove or blur any personal identifiers (names/phone numbers/addresses), and will compensate you [$X]. Do you agree to grant a non-exclusive license to use these images under these terms? If yes, please reply “I consent” and I’ll send a short release form. Thank you!
Sample minimal consent clause (for a signed message/email)
I, [name], grant [your company/researcher name] a non-exclusive, royalty-free, worldwide license to use, copy, modify, and distribute the images I provided for the purpose of training and evaluating machine learning models. I confirm I own the copyright or have permission to license the images. I consent to anonymization (blurring of identifying text/faces) as needed. — [name, date]
3) Anonymize and minimize personal data
-
Blur faces, license plates, names, phone numbers, addresses before storing/annotating.
-
Strip EXIF metadata (GPS, device owner, etc.) automatically.
4) Log provenance & consent
-
Keep a spreadsheet linking each image to the consent record and any payment you gave. This is critical for audits or takedown requests.
5) Compensate fairly
-
Offer a modest payment or discount for sellers’ consent — this makes people more likely to agree and is fair.
6) Use an annotation platform
-
Import the anonymized images into a proper annotation tool (CVAT, Labelbox, Supervisely, LabelImg) and use trained annotators or quality-control loops.
7) Legal check
-
Before release or commercial use, have legal counsel review your license and process — particularly if faces or private info appear.
Practical alternative sources (recommended)
These options avoid the mess above and are faster, safer, and usually cheaper:
Open / curated image datasets
-
COCO, Open Images, ImageNet, Google’s Open Images (various licenses) — good for object detection/classification tasks.
-
Domain-specific public datasets (furniture, appliances, vehicles) — search Kaggle or Hugging Face Datasets.
Paid data vendors and annotation services
-
Scale AI, Labelbox, SuperAnnotate, Appen — provide collection + annotation at scale with compliance.
Crowdsourced collection with consent
-
Use Mechanical Turk / Prolific / Appen to have workers capture and upload photos under your consent form (you control instructions and metadata). This yields high consent guarantees and consistent data.
Synthetic data
-
Tools like BlenderProc, Unity Perception, or synthetic-image providers can generate labeled images with perfect labels for many vision tasks.
Quick recommended workflow (safe & scalable)
-
Define dataset spec: classes, resolution, annotation type, quantity.
-
Choose source: public datasets → preferred; else crowdsource (with consent).
-
Collect with consent: use a web form or worker platform to gather images + consent.
-
Anonymize & strip metadata.
-
Annotate in tool: label with a workflow that includes QA (2+ annotators + adjudication).
-
Store & document: licenses, provenance, demographics, and README.
-
Legal review before release.
Comments
Post a Comment