Tagged | big-data
-
Efficient Point in Polygon Joins via PySpark and BNG Geospatial Indexing
(databricks.com) -
How to ETL at Petabyte-Scale with Trino
(engineering.salesforce.com) -
Improving HDFS I/O Utilization for Efficiency
(eng.uber.com)#performance #distributed-systems #big-data #data-engineering
-
Efficiently Managing the Supply and Demand on Uber’s Big Data Platform
(eng.uber.com)#software-architecture #infra #distributed-systems #big-data
-
Cost-Efficient Open Source Big Data Platform at Uber
(eng.uber.com)#optimisation #distributed-systems #big-data #data-engineering
-
Challenges and Opportunities to Dramatically Reduce the Cost of Uber’s Big Data
(eng.uber.com) -
Evolution of search engines architecture - Algolia New Search Architecture Part 1
(highscalability.com) -
‘Orders Near You’ and User-Facing Analytics on Real-Time Geospatial Data
(eng.uber.com) -
Interactive Querying with Apache Spark SQL at Pinterest
(medium.com) -
Consolidating Facebook storage infrastructure with Tectonic file system
(engineering.fb.com) -
Optimizing Analytics Data Processing on eBay’s New Open-Source-Based Platform
(tech.ebayinc.com) -
The exabyte club: LinkedIn’s journey of scaling the Hadoop Distributed File System
(engineering.linkedin.com)#scaling #distributed-systems #analytics #big-data #data-engineering
-
Fraud Detection: Using Relational Graph Learning to Detect Collusion
(eng.uber.com) -
Introducing Orbit, An Open Source Package for Time Series Inference and Forecasting
(eng.uber.com) -
Fusing Elasticsearch with neural networks to identify data
(blog.twitter.com) -
The Journey of Corpus
(developers.soundcloud.com) -
Categorizing Products at Scale
(engineering.shopify.com) -
Presentation: Beyond the Distributed Monolith: Rearchitecting the Big Data Platform
(www.infoq.com) -
Merging Telemetry and Logs from Microservices at Scale with Apache Spark
(devblogs.nvidia.com) -
Learning Multi-dimensional indices: The next big thing in OLAP DBs
(towardsdatascience.com) -
Presentation: From Spark to Elasticsearch and Back - Learning Large-scale Models for Content Recommendation
(www.infoq.com) -
Presentation: Computational Propaganda - How Algorithms Influence our Decisions
(www.infoq.com) -
Architecture for High-Throughput Low-Latency Big Data Pipeline on Cloud
(towardsdatascience.com) -
Contextual Topic Identification
(blog.insightdatascience.com)#data-science #machine-learning #NLP #big-data #text-analysis
-
Supporting Spark as a First-Class Citizen in Yelp’s Computing Platform
(engineeringblog.yelp.com)#data-pipeline #distributed-systems #apache-spark #big-data #backend
-
The Causal Analysis of Cannibalization in Online Products
(codeascraft.com) -
Auto-Generated Knowledge Graphs
(towardsdatascience.com) -
Leveraging “spot” instances to drive down costs
(www.eventbrite.com) -
Deep Learning for Anomaly Detection
(blog.cloudera.com) -
Spotify Unwrapped: How we brought you a decade of data
(labs.spotify.com) -
Fanatics: Using Scylla for Online Order Capture
(www.scylladb.com) -
Infinite Storage in Confluent Platform
(www.confluent.io)#distributed-systems #apache-kafka #big-data #data-engineering
-
Bayesian Product Ranking at Wayfair
(tech.wayfair.com) -
Keeping LinkedIn professional by detecting and removing inappropriate profiles
(engineering.linkedin.com) -
For Your Ears Only: Personalizing Spotify Home with Machine Learning
(labs.spotify.com) -
Engineering SQL Support on Apache Pinot at Uber
(eng.uber.com) -
The Winding Road to Better Machine Learning Infrastructure Through Tensorflow Extended and Kubeflow
(labs.spotify.com) -
How Scylla Scaled to One Billion Rows a Second
(www.scylladb.com) -
Pretraining BERT with Layer-wise Adaptive Learning Rates
(devblogs.nvidia.com) -
Food Discovery with Uber Eats: Using Graph Learning to Power Recommendations
(eng.uber.com) -
Our Transition to Machine Learning in Search Ranking to Match Customers and Professionals
(engineering.thumbtack.com) -
Powered by AI: Instagram’s Explore recommender system
(instagram-engineering.com) -
Large Graph Visualization Tools and Approaches
(towardsdatascience.com) -
Spotify’s Event Delivery – Life in the Cloud
(labs.spotify.com) -
Optimizing Search Index Generation using secondary cache
(medium.com)#performance #distributed-systems #big-data #caching #data-engineering
-
Interpretability in ML: Identifying anomalies, influencers, and root causes
(www.elastic.co) -
Griffin, an anti-fraud risk rule engine making billions of predictions daily
(engineering.grab.com)#data-science #software-engineering #software-architecture #algorithms #big-data
-
Semantics at Scale: BERT + Elasticsearch
(towardsdatascience.com) -
Guide to File Formats for Machine Learning: Columnar, Training, Inferencing, and the Feature Store
(towardsdatascience.com) -
Presto for ad hoc interactive Big Data Analytics at Salesforce
(engineering.salesforce.com) -
Searchable Ground Truth: Querying Uncommon Scenarios in Self-Driving Car Development
(eng.uber.com) -
Real-time experiment analytics at Pinterest using Apache Flink
(medium.com) -
What Makes Apache Flink Scale?
(medium.com) -
How Shopify Manages Petabyte Scale MySQL Backup and Restore
(engineering.shopify.com) -
MaRS: How Facebook keeps maps current and accurate
(engineering.fb.com) -
Shared Transactional Tables: The Foundation of Next Generation Big Data Warehousing
(blog.cloudera.com) -
Pilosa: A Scalable High Performance Bitmap Database Index
(hackernoon.com) -
How Map Matching Failures can be used for Map Making
(eng.lyft.com) -
PinText: A Multitask Text Embedding System in Pinterest
(medium.com) -
BIG Data, Fast Data - Part I
(www.thoughtworks.com) -
Adventures in big data wonderland: Going down the Pinterest Path
(medium.com) -
Pin2Interest: A scalable system for content classification
(medium.com) -
Labeling, transforming, and structuring training data sets for machine learning
(www.oreilly.com) -
Data Hub: A Generalized Metadata Search & Discovery Tool
(engineering.linkedin.com) -
Detecting and Preventing Abuse on LinkedIn Using Isolation Forests
(engineering.linkedin.com) -
Presentation: Tackling Computing Challenges @CERN
(www.infoq.com) -
Unifying visual embeddings for visual search at Pinterest
(medium.com)#machine-learning #image-processing #search #big-data #research
-
Code as Craft: Understand the role of Style in e-commerce shopping
(codeascraft.com) -
The science behind consolidating Answer Bot production Models: Part 1
(medium.com) -
Moving from Data-Driven to AI-Driven: The Next Step in the Evolution of Business Workflows
(multithreaded.stitchfix.com) -
Semantic Graphs
(blog.imaginea.com) -
Lynx: Identifying Wayfair Customers’ Functional Needs
(tech.wayfair.com) -
Give Me Jeans not Shoes: How BERT Helps Us Deliver What Clients Want
(multithreaded.stitchfix.com) -
Presto at Pinterest
(medium.com) -
Presentation: Reinforcement Learning: A Gentle Introduction with a Real Application
(www.infoq.com) -
Gaining Insights in a Simulated Marketplace with Machine Learning at Uber
(eng.uber.com) -
Presentation: Scaling Deep Learning to Petaflops and Beyond!
(www.infoq.com) -
Putting Machine Learning Models into Production
(blog.cloudera.com)#data-science #machine-learning #big-data #production #data-engineering
-
The quest for high-quality data
(www.oreilly.com) -
Presentation: Petastorm: A Light-Weight Approach to Building ML Pipelines
(www.infoq.com)#data-pipeline #machine-learning #big-data #data-engineering
-
Rethinking the Database Materialized View as an Index
(blog.timescale.com) -
Presentation: Applying Deep Learning to Airbnb Search
(www.infoq.com)#deep-learning #data-science #machine-learning #search #big-data
-
Modeling the Unseen
(tech.instacart.com) -
Log Compacted Topics in Apache Kafka
(towardsdatascience.com) -
Migrating a Big Data Environment to the Cloud, Part 4
(liveramp.com)#software-architecture #big-data #migration #cloud #data-engineering
-
Migrating a Big Data Environment to the Cloud, Part 3
(liveramp.com) -
A Richer Activity, Part 1
(medium.com) -
Presentation: Massive Scale Anomaly Detection Framework
(www.infoq.com) -
Introducing LINE Games analytics environment
(engineering.linecorp.com)#data-pipeline #software-architecture #big-data #data-engineering
-
Accelerating Machine Learning with the Feature Store Service
(technology.condenast.com) -
Hybrid Search: Building a textual and visual discovery experience at Pinterest
(medium.com) -
Deep Learning for Single Cell Biology
(towardsdatascience.com) -
Presentation: Forecasting in Complex Systems
(www.infoq.com) -
Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask
(eng.uber.com) -
Driving Business Decisions Using Data Science and Machine Learning
(engineering.linkedin.com) -
Evaluating the Unsupervised Learning of Disentangled Representations
(ai.googleblog.com) -
Consistent Data Partitioning through Global Indexing for Large Apache Hadoop Tables at Uber
(eng.uber.com) -
Extracting knowledge from knowledge graphs.
(towardsdatascience.com) -
Better to Give and to Receive: Alibaba’s Open-source Contributions to Flink
(hackernoon.com) -
How Bloomberg Tracks Hundreds of Billions of Data Points Daily with MetricTank and Grafana
(grafana.com) -
How eBay Governs its Big Data Fabric
(www.ebayinc.com) -
Presentation: Fairness, Transparency, and Privacy in AI @LinkedIn
(www.infoq.com) -
Uber Case Study: Choosing the Right HDFS File Format for Your Apache Spark Jobs
(eng.uber.com) -
Pro Tips: How Booking.com Handles Millions of Metrics Per Second with Graphite
(grafana.com) -
Lessons from our journey to enable global code search with Elasticsearch on GitLab.com
(about.gitlab.com) -
Solving Big Data Challenges with Data Science at Uber
(eng.uber.com)#DBMS #scaling #distributed-systems #big-data #data-engineering
-
Presentation: Would You Have Clicked on What We Would Have Recommended?
(www.infoq.com) -
Tackling Bias in Machine Learning
(blog.insightdatascience.com) -
Productionizing ML with workflows at Twitter
(blog.twitter.com)#software-engineering #software-architecture #machine-learning #big-data #production
-
Transparent Hierarchical Storage Management with Apache Kudu and Impala
(blog.cloudera.com) -
Rendezvous Architecture for Data Science in Production
(towardsdatascience.com)#data-science #software-architecture #DBMS #distributed-systems #big-data
-
Managing Uber’s Data Workflows at Scale
(eng.uber.com)#data-pipeline #DBMS #scaling #distributed-systems #big-data
-
How Netflix Uses AI and Machine Learning
(becominghuman.ai) -
Three Principles of Data Warehouse Development
(www.toptal.com) -
3 reasons to add deep learning to your time series toolkit
(www.oreilly.com) -
Lambda architecture— how to build a Big data pipeline part 1
(towardsdatascience.com) -
Understanding Supply & Demand in Ride-hailing Through the Lens of Data
(engineering.grab.com) -
Complementary Item Recommendations at eBay Scale
(www.ebayinc.com)#data-pipeline #software-architecture #machine-learning #big-data
-
Learning Hiring Preferences: The AI Behind LinkedIn Jobs
(engineering.linkedin.com) -
Contextualizing Airbnb by Building Knowledge Graph
(medium.com) -
Understanding Customer Churning with Big Data Analytics
(towardsdatascience.com) -
Big Data Metrics Discovery
(engineering.salesforce.com)#software-architecture #distributed-systems #big-data #backend
-
Explainable Reasoning over Knowledge Graphs for Recommendation
(www.ebayinc.com) -
Interactive Visual Search
(www.ebayinc.com)#machine-learning #image-processing #algorithms #search #big-data
-
Introducing Feast
(towardsdatascience.com) -
Keeping It Classy: How Quizlet uses hierarchical classification to label content with academic…
(towardsdatascience.com) -
Why we've chosen Snowflake ❄️ as our Data Warehouse
(drivy.engineering) -
A Deep Dive Into Data Quality
(towardsdatascience.com) -
Presentation: Designing Automated Pipelines for Unseen Custom Data
(www.infoq.com) -
Presentation: Nearline Recommendations for Active Communities @LinkedIn
(www.infoq.com) -
Generating Twitter Ego-Networks & Detecting Ego-Communities
(towardsdatascience.com)#data-analytics #big-data #graph-processing #visualisation #social-networks
-
Using Economic Graph Data to Power the LinkedIn Salary Product
(engineering.linkedin.com) -
HyperLogLog in Presto: A significantly faster way to handle cardinality estimation
(code.fb.com) -
Implementing the Netflix Media Database
(medium.com) -
Boosting Big Data workloads with Presto Auto Scaling
(www.eventbrite.com) -
TagOverflow — Correlating Tags in Stackoverflow
(towardsdatascience.com) -
Providing Metadata Discovery on Large-Volume Data Sets
(www.ebayinc.com) -
Predicting real-time availability of 200 million grocery items in US/Canada stores
(tech.instacart.com) -
Tag-based Navigation of a Fashion Catalog
(jobs.zalando.com) -
Seven Tips for Visual Search at Scale
(www.ebayinc.com) -
Splitting Stateful Services across Continents at Instagram
(www.infoq.com) -
The Best Data Visualizations for Grabbing Readers’ Attention
(hackernoon.com) -
New fastMRI open source AI research tools from Facebook and NYU School of Medicine
(code.fb.com) -
The Fundamental Problem of Search
(www.eventbrite.com) -
Handling Imbalanced Datasets in Deep Learning
(towardsdatascience.com) -
Presentation: Big Data and Deep Learning: A Tale of Two Systems
(www.infoq.com) -
boundary-layer : Declarative Airflow Workflows
(codeascraft.com) -
Druid @ Airbnb Data Platform
(medium.com)#data-pipeline #software-architecture #analytics #big-data #druid
-
Netflix MediaDatabase — Media Timeline Data Model
(medium.com) -
Splitting Millions of Source Code Identifiers with Deep Learning
(blog.sourced.tech) -
ModaNet: A Large-scale Street Fashion Dataset with Polygon Annotations
(www.ebayinc.com) -
Horizon: The first open source reinforcement learning platform for large-scale products and services
(code.fb.com) -
Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads
(eng.uber.com)#data-pipeline #software-architecture #distributed-systems #big-data #backend
-
Uber Introduces PyML: Their Secret Weapon for Rapid Machine Learning Development
(towardsdatascience.com) -
Turnilo — let’s change the way people explore Big Data
(allegro.tech) -
Uber’s Big Data Platform: 100+ Petabytes with Minute Latency
(eng.uber.com)#software-architecture #distributed-systems #big-data #systems
-
Using machine learning to index text from billions of images
(blogs.dropbox.com) -
An Introduction to AI at LinkedIn
(engineering.linkedin.com) -
Managing data store locality at scale with Akkio
(code.fb.com) -
Building Google Dataset Search and Fostering an Open Data Ecosystem
(ai.googleblog.com) -
Architecture of Nautilus, the new Dropbox search engine
(blogs.dropbox.com)#software-architecture #search #scaling #big-data #filesystem
-
Introducing Petastorm: Uber ATG’s Data Access Library for Deep Learning
(eng.uber.com)#data-pipeline #deep-learning #software-architecture #big-data
-
Big Data Governance: Hive Metastore Listener for Apache Atlas Use Cases
(www.ebayinc.com) -
Open Sourcing TonY: Native Support of TensorFlow on Hadoop
(engineering.linkedin.com) -
Introducing Oak: an Open Source Scalable Key-Value Map for Big Data Analytics
(yahooeng.tumblr.com) -
Marmaray: An Open Source Generic Data Ingestion and Dispersal Framework and Library for Apache Hadoop
(ubereng.wpengine.com) -
Progress for big data in Kubernetes
(www.oreilly.com) -
Scaling neural machine translation to bigger data sets with faster training and inference
(code.fb.com) -
Data Wrangling with Apache Kafka and KSQL
(www.confluent.io) -
Data Warehousing and ETLs
(medium.com) -
An introduction to Druid, your Interactive Analytics at (big) Scale
(towardsdatascience.com) -
Unriddling Big Data file formats
(www.thoughtworks.com) -
Airflow, Meta Data Engineering, and a Data Platform for the World’s Largest Democracy
(hackernoon.com) -
Leveraging Elastic Demand for Forecasting
(tech.instacart.com) -
From Big Data to micro-services: how to serve Spark-trained models through AWS lambdas
(towardsdatascience.com) -
Parallelizing Feature Engineering with Dask
(towardsdatascience.com) -
Learning Market Dynamics for Optimal Pricing
(medium.com) -
Developing a Bioinformatics Database for Disulfide Bonds Research
(www.toptal.com) -
How we build a robust analytics platform using Spark, Kafka and Cassandra Lambda architecture
(medium.com) -
Databook: Turning Big Data into Knowledge with Metadata at Uber
(eng.uber.com) -
Comparing Billions of Rows per Day
(segment.com) -
Building a Graph Data Pipeline With Zeppelin Spark and Neo4j
(towardsdatascience.com) -
FastText: Under the Hood
(towardsdatascience.com) -
Distributed graphs processing with Spark GraphX
(hackernoon.com) -
Migrating Messenger storage to optimize performance
(code.fb.com) -
H3: Uber’s Hexagonal Hierarchical Spatial Index
(eng.uber.com) -
Migrating Messenger storage to optimize performance
(code.facebook.com) -
Productionizing ML with Workflows at Twitter
(blog.twitter.com)#data-pipeline #software-architecture #machine-learning #big-data
-
Presentation: The Future of Distributed Databases Is Relational
(www.infoq.com) -
Presentation: Simplifying ML Workflows with Apache Beam
(www.infoq.com) -
Presentation: Gimel: PayPal’s Analytics Data Platform
(www.infoq.com) -
Introducing Commute Time for Jobs
(engineering.linkedin.com) -
Solr: Improving performance for Batch Indexing
(blog.box.com) -
How Microservices Could Save Medical IoT
(nordicapis.com) -
Apache Spark - Performance
(blog.scottlogic.com) -
Data Retrieval and Cleaning: Tracking Migratory Patterns
(www.dataquest.io) -
Centrifuge: a reliable system for delivering billions of events per day
(segment.com) -
Structure & Attribute Based Graph Partitioning
(medium.com) -
Looking under the hood of the Eventbrite data pipeline!
(www.eventbrite.com) -
Exploring The GitHub Archive
(blog.wallaroolabs.com) -
Balanced Partitioning and Hierarchical Clustering at Scale
(ai.googleblog.com) -
Utilizing MapReduce Combiners and HyperLogLog++ to process millions of queries over datasets with billions of records
(liveramp.com) -
Then and Now: The Rethinking of Time Series Data at Wayfair
(tech.wayfair.com) -
Introducing Semantic Experiences with Talk to Books and Semantris
(research.googleblog.com) -
Simon Moss on using artificial intelligence to fight financial crimes
(www.oreilly.com) -
Give Meaning to 100 billion Events a Day - The Analytics Pipeline at Teads
(highscalability.com) -
Scaling Uber’s Hadoop Distributed File System for Growth
(eng.uber.com) -
Extracting Signals From the News
(eng.datafox.com) -
A brief introduction to two data processing architectures — Lambda and Kappa for Big Data
(towardsdatascience.com) -
Search Federation Architecture at LinkedIn
(engineering.linkedin.com) -
The Evolution of Data at Reddit
(redditblog.com) -
Data Analysis with Spark
(jobs.zalando.com) -
Under the hood: Suicide prevention tools powered by AI
(code.facebook.com) -
A Cornucopia of Area Rugs: Will a Diverse Set of Choices Help Customers Find More of What They Love?
(tech.wayfair.com) -
How to hack Spark to do some data lineage
(blog.octo.com) -
Creating a musical (data) pipeline
(devblog.songkick.com) -
Mis-employing radar charts to distinguish multidimensional data
(towardsdatascience.com) -
How to add full text search to your website
(medium.com) -
Cross-Lingual End-to-End Product Search with Deep Learning
(jobs.zalando.com) -
Dynamometer: Scale Testing HDFS on Minimal Hardware with Maximum Fidelity
(engineering.linkedin.com) -
From big data to fast data
(www.oreilly.com) -
Using Synthetic Data Modeling to Enhance Machine Learning
(engineering.salesforce.com) -
Caviar’s Word2Vec Tagging For Menu Item Recommendations
(medium.com) -
Time Series Forecasting with Splunk. Part I. Intro & Kalman Filter.
(towardsdatascience.com) -
Scaling Time Series Data Storage — Part I
(medium.com) -
PageRank in Spark
(developers.soundcloud.com) -
Omphalos, Uber’s Parallel and Language-Extensible Time Series Backtesting Tool
(eng.uber.com) -
Fishing for graphs in a Hadoop data lake
(www.oreilly.com) -
Mapping Medium’s Tags
(medium.engineering) -
Faster E-commerce Search
(www.ebayinc.com) -
The frequency of tags on Stack Overflow
(towardsdatascience.com) -
Evolving search recommendations on Pinterest
(medium.com) -
The Art of Effective Visualization of Multi-dimensional Data
(towardsdatascience.com) -
Bad Design Is Bad for Your Health: Why Data Visualization Details Matter
(engineering.cerner.com) -
Big Data: Information visualization techniques
(towardsdatascience.com) -
Out of Core Genomics
(towardsdatascience.com) -
Large-Scale Health Data Analytics with OHDSI
(blog.cloudera.com) -
How machine learning will accelerate data management systems
(www.oreilly.com) -
Hadoop Delegation Tokens Explained
(blog.cloudera.com) -
DeepVariant: Highly Accurate Genomes With Deep Neural Networks
(research.googleblog.com) -
[Episode 01] Airbnb, Machine Learning & the Future of Travel
(mesosphere.com) -
Incremental Data Capture for Oracle Databases at LinkedIn: Then and Now
(engineering.linkedin.com) -
Dali Views: Functions as a Service for Big Data
(engineering.linkedin.com) -
Rebuilding the Segment Leaderboards Infrastructure — Part 3: Design of the New System
(medium.com)#stream-processing #apache-kafka #big-data #backend #cassandra
-
The Global Heatmap, Now 6x Hotter
(medium.com) -
Big Data Processing at Spotify: The Road to Scio (Part 1)
(labs.spotify.com) -
Airflow: The Missing Context
(hackernoon.com) -
Using Kafka Streams API for predictive budgeting
(medium.com) -
Big Dataset: All Reddit Comments – Analyzing with ClickHouse
(www.percona.com) -
One Million Tables in MySQL 8.0
(www.percona.com) -
The Search for Better Search at Reddit - Because, certainly, we’ve solved it this time
(redditblog.com) -
Exploring and Visualizing an Open Global Dataset
(research.googleblog.com) -
Steering oceans of content to the world
(code.facebook.com) -
IMDb Data in a Graph Database
(www.percona.com) -
Implementing Temporal Graphs with Apache TinkerPop and HGraphDB
(blog.cloudera.com) -
Breaking the “curse of dimensionality” in Genomics using “wide” Random Forests
(databricks.com) -
Building the Activity Graph, Part 2
(engineering.linkedin.com) -
BigDB - an ad data pipeline for LINE
(engineering.linecorp.com) -
Engineering Data Analytics with Presto and Parquet at Uber
(eng.uber.com)