Handling Big Data Challenges: A Case Study of AllFreeNovel.cc
AllFreeNovel.cc ## Technical Challenges & Solutions 1. Data Ingestion Bottlenecks Problem: Daily ingestion of 50,000+ new chapters from multiple sources (CN/JP/KR) with varying formats: XML feeds from Korean publishers JSON APIs from Chinese platforms Raw text dumps from Japanese partners Solution: # Distributed ETL Pipeline class ChapterIngestor: def __init__(self): self.kafka_topic = "raw-chapters" self.schema_registry = AvroSchemaRegistry() async def process(self, source): async for chunk in source.stream(): normalized = await self._normalize(chunk) await kafka.produce( self.kafka_topic, value=normalized, schema=self.schema_registry.get(source.format) ) 2. Search Performance Optimization Metrics Before Optimization: 1200ms average query latency 78% cache miss rate 12-node Elasticsearch cluster at 85% load Implemented Solutions: Hybrid Index Strategy Hot data (latest chapters): In-memory RedisSearch Warm data: Elasticsearch with custom tokenizer Cold data: ClickHouse columnar storage Query Pipeline: graph TD A[User Query] --> B{Query Type?} B -->|Simple| C[RedisSearch] B -->|Complex| D[Elasticsearch] B -->|Analytics| E[ClickHouse] C/D/E --> F[Result Blender] F --> G[Response] 3. Real-time Recommendations Challenge: Generate personalized suggestions for 2M+ DAU with

AllFreeNovel.cc
## Technical Challenges & Solutions
1. Data Ingestion Bottlenecks
Problem:
Daily ingestion of 50,000+ new chapters from multiple sources (CN/JP/KR) with varying formats:
- XML feeds from Korean publishers
- JSON APIs from Chinese platforms
- Raw text dumps from Japanese partners
Solution:
# Distributed ETL Pipeline
class ChapterIngestor:
def __init__(self):
self.kafka_topic = "raw-chapters"
self.schema_registry = AvroSchemaRegistry()
async def process(self, source):
async for chunk in source.stream():
normalized = await self._normalize(chunk)
await kafka.produce(
self.kafka_topic,
value=normalized,
schema=self.schema_registry.get(source.format)
)
2. Search Performance Optimization
Metrics Before Optimization:
- 1200ms average query latency
- 78% cache miss rate
- 12-node Elasticsearch cluster at 85% load
Implemented Solutions:
-
Hybrid Index Strategy
- Hot data (latest chapters): In-memory RedisSearch
- Warm data: Elasticsearch with custom tokenizer
- Cold data: ClickHouse columnar storage
Query Pipeline:
graph TD
A[User Query] --> B{Query Type?}
B -->|Simple| C[RedisSearch]
B -->|Complex| D[Elasticsearch]
B -->|Analytics| E[ClickHouse]
C/D/E --> F[Result Blender]
F --> G[Response]
3. Real-time Recommendations
Challenge:
Generate personalized suggestions for 2M+ DAU with <100ms latency
ML Serving Architecture:
┌─────────────┐ ┌─────────────┐
│ Feature Store│◄─────│ Flink Jobs │
└──────┬───────┘ └─────────────┘
│
┌──────▼───────┐ ┌─────────────┐
│ Model Cache │─────►│ ONNX │
└──────┬───────┘ │ Runtime │
│ └─────────────┘
┌──────▼───────┐
│ User │
│ Interactions │
└──────────────┘
Results:
- P99 latency reduced from 2200ms → 89ms
- Recommendation CTR increased by 37%
- Monthly infrastructure cost saved: $28,500
Key Takeaways
- Data Tiering is crucial for cost-performance balance
- Asynchronous Processing prevents pipeline backpressure
- Hybrid Indexing enables optimal query performance
- Model Optimization (ONNX conversion) dramatically improves ML serving